# Drive-by Machine Learning Updates

A Whirlwind Tour of the Past Few Weeks of AI/ML Learning

Things are getting busy — end of spring semester, beginning of summer semester in my M.S. Health Data Science Program, and since my last Cornell-specific update, I’ve finished two more classes and just have one to go which starts next week. That, plus some exciting related news at work (more later, perhaps) has meant a lot of time learning and not as much time available for writing — a quality problem to have, indeed! So in case you’re interested, this is a quick summary of things I have done in the last few weeks, what I’ve learned, and so on.

# Making Decision Trees Work in the Real World

Decision Trees on their own turn out to be pretty terrible machine learning algorithms due to a propensity for high bias or high variance. At a high level:

Variance occurs when a decision tree gets too specific — i.e. goes too deep — on learning the training data set, which means you get really high accuracy on your training data but when you introduce new data, it no longer performs well. A commonly used term for this is “overfitting.”

Bias occurs when you don’t get specific enough — i.e. your decision tree doesn’t go deep enough, so the model can’t get good results even on the training data. This is commonly referred to as “underfitting.”

You can have both bias and variance present at the same time, so how do you address this? For variance, Random Forest is a popular approach. Variance can be addressed by sampling additional data from data sets and building new trees, then averaging the results over the generated trees to make your prediction. In order to get this additional data (since most of the time you’re using everything you have access to from the start) you can use bootstrapping with replacement — building child-datasets of the same size out of your original data by pulling out a random observation, adding it to the child, then putting it back into the data and pulling out another random observation — a process known as bootstrap aggregation. You end up with a bunch of similar datasets which will all have randomly duplicated and omitted observations. Furthermore at each level of your tree you are only evaluating a random sampling a subset of the variables (usually the square root of the total number.) Each individual tree will have even more variance than the original would have. However, when you run this across dozens, hundreds, or thousands of iterations of these trees and average the results (a process called “bagging”), you end up a model with significantly ** less** variance in your model. Since Decision Trees are fast to build and evaluate, this ends up being a fantastic approach in high variance situations.

If bias is your issue, your Decision Tree is known as a “weak learner” — it’s just not very accurate. Boosting is the method you use to correct for it, and in particular, Gradient Boosted Regression Trees are a popular approach. Think of this as an iterative approach that is similar to Logistic Regression in that it uses gradient descent to find the optimal solution. You start with a high-bias decision tree (so limited depth) and you compute the difference between the actual label in your training data and the predicted label. This is called the “residual.” You then train another Decision Tree on the same data, but instead of trying to predict the original label, you are now trying to predict the residual value, and use that as the new predicted value. You add the predictions from each iteration to your overall model, weighted each time by a small number (for example, each value times 0.1 or 0.05) until you reach the point where your residual value is all zeros, which means you now have an accurate model for your training data and have mitigated the bias.

Using boosting and bagging to mitigate bias and variance in Decision Trees makes them extremely powerful machine learning algorithms which, again, have the advantage of being relatively quick and inexpensive to train and evaluate.

**Support Vector Machines — The Perceptron Gets a Serious Upgrade**

If you remember back to the Perceptron, it was a great early effort at building a machine learning predictive model. However, there are two big drawbacks. The first is the algorithm assumes a separating hyperplane exists for your data points. If this ends up not being true, either because of mislabeled data or other anomalies, the Perceptron will never produce a solution. The other drawback is that the Perceptron stops as soon as it finds a hyperplane that separates the data correctly. There is no way to evaluate whether the dividing line is the ** best possible** hyperplane.

Enter Maximum Margin Classifiers (MMCs). In essence, MMCs seek to optimize the separating hyperplane by some logic, usually by calculating the distance from the hyperplane and the closest point(s) on either side of it and making those as close to equal as possible. These data points are referred to as Support Vectors, and the programs written to use them with MMCs are called Support Vector Machines, or SVMs.

In addition to finding the best possible hyperplane where the data are linearly separable, SVMs also introduce the concept of Slack Variables. This allows you to identify the best possible hyperplane even when the data aren’t linearly separable — meaning, you by necessity have to allow for some mislabeled data points in your training data. You can control how tight or loose these Slack Variables are, which will determine how long it takes your algorithm to get to an answer as well as what level of precision it will demand before calling itself done. But in the end, even though you have some training error, you still get a separating hyperplane.

When dealing with high dimensional data, bias ends up being a problem with SVM, which you can overcome in a computationally efficient way by employing “the kernel trick.” The math and theory behind this goes beyond what I usually write about, so let me know if you’re dying to get into the weeds on it. The interesting and fun thing, though, is that once you worked through the math, the actual implementation of a “kernelized” SVM model had a lot in common with the linear SVM code, so actually writing the function to do the work was decently straightforward.

# Inferential vs. Predictive Modeling

And finally — in my Master’s Degree program, we are diving into Inferential Modeling. The differences between inferential and predictive modeling are subtle, but important, and both have their place in the ML world. They both use similar tools and data, but whereas the goal of predictive modeling is to find the most accurate model, the goal of inferential modeling is to find the model that has the best fit. In other words — “which variables in my model are predictive, how predictive are they, what affect do they have,” and so on, whereas in predictive modeling, it’s simply “if I throw in these variables, how accurate is my result?” I’m already fascinated by the delineation between the two approaches, and will look forward to providing additional updates as I learn more.

So with that, you’re within about a week of being all caught up. Let me know if there’s anything you’re particularly interested in my personal take on a deeper dive.