Fuzzy Wuzzy Neural Nets
Better Testing Error Gets!
As a kid, I loved the play on words of the “Fuzzy Wuzzy was a bear…” tongue twister (probably a little too much). It struck a chord with me, not only because it was hard to say five times fast, but because of the imagery of a hairless bear (hilarious when I was 7–8 years old) and the irony of the fact that you would name a hairless bear “Fuzzy.” Similarly, while I have discussed the problem of overfitting and noise in data a couple of times before, I still get a kick out of how much better machine learning predictions can be when you introduce “fuzziness” into them via regularization. Intentionally making your predictions worse in training, and as a result, improving the performance of the resulting model on test and real-world data, is just weird. I understand the theory — just like 8-year-old Jason understood that a name didn’t necessarily define you in any physical sense — but I still appreciate the irony.
With neural networks, there are several approaches to regularization. In this post I’ll discuss three of them: stochastic gradient descent, weight decay, and dropout.
Stochastic Gradient Descent
Stochastic gradient descent (SGD) is one of those neat discoveries in data science that not only serves as a form of regularization, but also dramatically speeds up the process of training a model. When you have, say, a billion observations of high-dimensional data that you want to train a model on, performing even relatively simple forward and backward pass calculations can take a lot of time and compute power. As I’ve described before, stochastic gradient descent takes a ridiculously small subset of the total observations (say 64 or so) at random and performs the maths just on those data points. Then, at the next iteration, a new set (sometimes referred to as a “batch”) are selected and the process repeats.
Let’s say this visualization represented the entirety of our available data:
Each of these observations gets a prediction value, and then the loss calculated from that prediction vs. the actual value (which we already know because this is training data) is used to compute a gradient. We’ve seen this before, however now we don’t assume convexity, so there will actually be multiple weight and bias values that we could use that will produce an approximately zero-slope sum of gradients (in the picture above, that blue line at the bottom — the goal being to find weight values that make it relatively flat.)
As my first post on SGD mentioned, being precise and finding the nearest local minimum (i.e. set of weight and bias values that produce the flat sum of gradients…) isn’t the goal here. You want to find a set of weight and bias values such that the prediction loss remains relatively stable even if the weight and bias values shift somewhat in one direction or another. This is what it means to find a “wide local minimum.” SGD is just sloppy enough, then, that it makes it highly unlikely to converge satisfactorily unless the weight and bias values it finds remain accurate across different subsets of the data in the datasets.
There are a few different ways to actually implement SGD, but we’ll go with a conceptually simple one, and start by selecting our random data points (sometimes referred to as a “batch”) before we implement a forward pass through our neural network.
We run just those data points through the neural network, compute the loss, and calculate our gradients and weight corrections using back propagation.
Then you repeat the process, only this time with a different batch of observations. Each batch takes a tiny fraction of the time it would have taken to compute the passes using the entire data set, so you get updates a lot faster, and in a lot of cases you will reach convergence without needing to actually perform calculations on every observation.
Faster, less computationally expensive, and more accurate predictions? It doesn’t get much better than that!
One problem with the neural network approach is that, by default, there are no constraints placed on the magnitude of the weights and bias values, and you can run into a situation where by some math quirk you end up with a few features that have gigantic impacts via their weights (which, again, can result in overfitting.) To mitigate this, when you calculate the loss, you impute an added loss value based on the magnitude of the weights. Typically, you might compute the square of the weight value (makes sure it’s positive) and then take the result times some value — often called a “lambda” value — which is a hyperparameter you have to set to determine how much impact larger weight values will be imputed to have on overall loss. Common lambda values might range between 0.1 and 0.001, but there are no hard and fast rules.
When the neural network minimizes the overall loss, then, it will be nudged towards smaller weight values, with the overall impact based on the lambda value you selected. In effect, you are rewarding the network for “spreading the wealth” in terms of weight values and preventing it from finding just a few features and over-relying on them. You are instructing the neural network to de-emphasize stronger features (and thus, emphasize weaker ones relatively speaking) in order to avoid overfitting.
Like most hyperparameters, the best lambda value can’t be known prior to testing and experimentation, therefore using some form of grid search to test multiple values until the best one is found is usually recommended.
This one is fun. When we discussed the Random Forest approach to decision trees, we talked about how at each split point in tree a random subset of overall features was used to decide the split point rather than all of the features in the data set, and this randomization occurred at every split point. This stochastic approach to feature selection and splitting served to regularize the trees, giving more weight to otherwise weaker variables and overall de-emphasizing stronger ones. (See a pattern here yet?) Dropout, while being unique to neural networks, is philosophically a very similar approach.
Basically, using dropout means you select a certain percentage of nodes in each hidden layer of your neural network (say, between 10% and 50% — another hyperparameter you need to experiment with), and you simply turn them off. For that forward pass, the impact of those weight and bias values at those stages simply do not count in terms of calculating loss, and thus do not impact the weight updates in that iteration. At the next iteration, you turn off the same number of nodes at each layer, but the node selection is at random every time — just like Random Forest decision trees at each split point. And again, the end result is the same — the neural network adjusts weights and often does not have the benefit of the features with strong signal, and so must figure out how to minimize the overall loss by placing more emphasis on weaker features.
SGD, weight decay, and dropout are just a few of the approaches available for regularization in neural networks. They introduce “fuzziness” into your calculations by penalizing weight values that get too large or by hiding random data or skipping random calculations along the way, which makes it harder for your neural network to quickly find precise math to converge perfectly on the training data. The effect of this, however, is that it reduces the impact of noise or other irregularities in the data you are working with, which means the model typically performs better on test data and in real-world predictions.
Fuzzy Wuzzy Neural Nets
Better Testing Error Gets!
If Fuzzy Models Error Low
Fuzzy Networks Steal the Show.