Invoking Intentional Inefficiency to Improve Inference
In a previous post I briefly touched on the problem with overfitting, which is loosely defined as a machine learning model that memorizes a training data set and thus provides high accuracy for predictions using it, but then performs poorly when presented with new data — a phenomenon known as variance. The post discussed the Random Forest approach using bootstrap aggregation to address this issue, but it begged the question: “Why does intentionally producing lower-quality data sets and averaging across their results produce better predictions?”
Reality, it turns out, is messy, so intentionally introducing inaccuracy in the process of producing predictions (that’s some impressive alliteration, don’t you think?) usually makes them better. It’s a process known as regularization.
It turns out that all kinds of machine learning algorithms have overfitting risks, and they way you regularize depends on the model you’re trying to fit. In Regression there are two popular regularization approaches, Ridge and Lasso. With Ridge regression, you essentially add a penalty to the values your model provides — i.e. you subtract something from the overall impact of each feature/variable to reduce its impact on the overall prediction. Ridge Regression is exponential, so the larger the impact a variable has on the model, the greater the regularization impact — i.e. you end up not trusting those super powerful variables as much as your data suggests you should. Lasso, on the other hand, ends up penalizing variables with small impacts to the point it drops them from the model altogether. This not only helps regularize the model, but can also be useful as a feature selection exercise for future model building. The one you might select would depend on your goals and the data you had available.
With Decision Trees, we’ve already discussed Random Forest. But what about Neural Networks? Well, it turns out there are a *bunch* of ways to introduce messiness there. Neural Networks are complex mechanisms with a bunch of moving parts, and just about any part of the process can be messed with for regularization purposes. For example, Neural Networks move data through a series of nodes and layers, starting with random weights and adjusting them across multiple iterations. While I find many regularization tactics fascinating (randomly dropping nodes in hidden layers, for example), my personal favorite from a “you’ve got to be kidding me” standpoint is to randomly drop most of your data when finding error minimum points using Stochastic Gradient Descent.
Neural Networks and Gradient Descent
Neural Networks use Gradient Descent for the same reasons Logistic Regression uses it — to find a minimum point of error in the data so that you can use it to make predictions. With linear data, this works really well, and the more variables / features you have, the more precise the line you draw can be. This is great when you have linear data, since your model can assume there is a single low error point it needs to find. However, as we saw in the previous post, Neural Networks are often figuring out functions that have many curves. Take this example:
With Neural Networks, you start at a random point and then use Gradient Descent to find the lowest point (the lowest error) you can. Say for example, you start here:
Gradient Descent would flow down the side of the wall you are on to find the lowest **local** minimum, so you’d end up here:
However, the next time the model is built, you might start here:
…and thus end up here:
The first thing to note is that the local minimum you find is almost completely dependent on your starting point. There are roughly four local minimums in this function:
Leaving things up to random chance and finding a really shallow minimum (like the first one from the left) is definitely a risk. Another risk, however, is finding a really deep minimum (like the second one from the left). Why is this a bad thing? After all, isn’t that technically the point with the lowest error?
This is a challenge because when you run test data through your model, moving just slightly in one direction or another has a big impact on the prediction. Move a small percentage left or right, and your predicted value changes dramatically.
The goal, then, is to find minimums at wide points in the overall function, where small movements don’t result in large predicted changes. The third and fourth minimums from the left would both work nicely for this. But if the minimum you’re going to find are dependent completely on where you happen to randomly start, that’s not going to provide satisfactory results a good percentage of the time.
Stochastic Gradient Descent
Throwing the Marble at The Bowl I Made in 2nd Grade Art
Enter regularization via Stochastic Gradient Descent. Stochastic is just a fancy word for random in this context. Let’s say you have a training data set with 2,000 variables. With traditional Gradient Descent, you use all of the variables available to determine the direction of the descent, meaning you’re going to have a really good chance of finding the lowest point closest to you. With Stochastic Gradient Descent, you randomly pick a small subset of those variables — say, just 32 of them — and base your calculation on that tiny fraction of the data. Rather than dropping a marble in a perfectly concave bowl, that’s more like throwing it every time, at an angle, and intentionally making it bounce pretty hard — and maybe not even in the right direction. It’s intentionally and probably egregiously introducing significant error into your accuracy analysis.
What does this buy you? Well, for starters, you’re going to bounce well out of shallow minimum areas pretty quickly, thus avoiding the first minimum in our function. Then, if you do happen to land in or near a deep but narrow minimum, you’re going to pretty quickly bounce your way out of that as well. Stochastic Gradient Descent can only approach a minimum, then, that is both wide enough and deep enough to contain it despite all of the bouncing around. Again, in our example function, either the third or fourth minimum areas would have worked, and while the fourth minimum was a lower absolute minimum than the third, either one would likely result in satisfactory predictions. Run your model a few times and you’ll find both, and then take averages or whatever you want to make the final prediction.
As you can see, this can get pretty complicated, pretty quickly. The bottom line, though, is that when you are building predictive models based on incomplete or imperfect data, you **want** to introduce methods that make them squishier, less sure of themselves, and less efficient at using the data available to them. Regularization is the way you accomplish this, and the methods available to you for regularize your models depend on the type of model you’re building. It’s part of the art that goes into data science, and I personally find it fascinating.