Why Machine Learning Isn’t Perfect (Yet)
Potholes to Producing Perfect Predictions: Bias and Variance (and Noise)
I have previously touched on decision trees (theory and application) as well as bagging, random forest, and gradient boosted trees at a really high level. In this post, I’m going to go deeper into what makes decision trees awesome, what makes them terrible, and what we can do to make them awesome again. In future writing, I’ll specifically dive into the more sophisticated approaches in more detail. For now, we just need to understand the problem.
Decision trees are fantastic because, like deep learning, they allow you to fit a machine learning model to nearly any functional shape. We’ve looked at this example before, but linear models struggle when the function they’re trying to discover / predict isn’t linear in the first place.
In this example, the dots represent our training data labels and the line represents our predictions based on a linear model. No matter how well fit the line is for this data, you are going to suffer from high “bias” — i.e. unacceptably high errors on the training data. And if your model can’t figure out how to get accurate with training data, then you can assume the performance of the model on testing and real-world data are going to be pretty terrible as well. The worst part about this is, it doesn’t really matter how much data you throw at the problem. More data points that follow the same patterns are not going to help a linear classifier make better predictions.
Decision trees solve this problem completely (too well, actually, but we’ll get to that…) Because they’re not taking a math-based approach to figuring out a predictive function, they can literally keep dividing into smaller and smaller subsets until your training error approaches zero (or exactly zero if you don’t have noise in the data.) So using the training data in the example above, you can keep building out the decision tree until your predictive function matches the training data perfectly.
The problem with decision trees, then, appears when you have high “variance” — in other words, really great accuracy on your training data, but really bad performance on testing or real-world situations. It turns out that data sets that machine learning models are built on don’t always (and it could be argued, rarely) represent the totality of truth in the real world. What if, for example, in the real world your data looked like this?
In this situation, our really fancy decision tree model is going to be **worse** than the linear model in the real world because the data it was trained on don’t reflect reality, and the tree model has learned some funky patterns that are anomolies based on the training data.
The good news about variance is that you can actually help mitigate it by adding more data. The more data you have, the closer to reality your decision tree model can be. In fact, if you have access to infinite and perfect data, you can build decision tree models that make perfect predictions in the real world as well. So why aren’t we just going out and getting more data, building perfect models, and calling it a day?
Because it’s impossible. Perfect data doesn’t exist. And you probably couldn’t afford it if it did.
So we’re stuck building imperfect models. Should we just abandon decision trees, then, since they suffer from high variance? Not at all! You can actually mitigate variance in decision trees by introducing bias — in other words, don’t let them build to perfectly shape the training data. Instead, stop them based on some criteria such as maximum depth, number of nodes, or number of data points present in a node. This lowers your training accuracy, but since you’re not fitting so closely to the training data, you’re usually improving the accuracy on test and other data. This is sometimes referred to as the bias-variance trade-off.
It turns out there are more sophisticated methods to reduce variance in decision trees. We can implement bootstrap aggregation (bagging) to take a limited data set and make it emulate “infinite” real-world data, and we can add to this a regularization technique called random forest to further fuzz up the training accuracy using a stochastic approach similar to stochastic gradient descent in decision trees. If bias is really a bigger issue, we can use a gradient descent approach (boosting) to slowly shave away at training and testing error at the same time, training an intentionally high-bias tree (something we’ve built to a fairly shallow depth, perhaps), adjusting its predictions based on how far off we are, and then retraining it again using our variance as the predicted label.
More to come!