Fake It Till You Make It
Decision Tree Models Using Bagging and Random Forest
In my previous post I discussed the problem of variance — defined as fitting a machine learning model to training data that doesn’t accurately reflect reality very well. This “overfitting” to training data produces great results when making predictions on that same training data, but performs poorly on testing data (or “holdout” data — data not included when training the model) and when making real-world predictions. As we discussed, this can produce really complex, fancy models that are worthless in the real world.
As we mentioned, one of the ways to improve variance is to simply add more data. The more data you have, the more likely it is to reflect reality, and as such, the more accurate your decision tree is going to be. Unfortunately, the logistics behind “simply add more data” are tremendously challenging. A lot of big advancements in AI at Google and other cloud providers centers around building systems to be able to handle ever larger amounts of data. And yes, when your data reach a certain size, they can in some instances reach amazing levels of predictive accuracy. But not always, and using / running these systems is complex and at scale can be prohibitively expensive (albeit 100x cheaper on a cloud than trying to build your own, but still, a million dollars is a million dollars…)If you don’t already know whether the data are going to get you where you need to go, jumping straight to MOAR DATA and PHENOMENAL COSMIC COMPUTE POWER…
…is a path fraught with heartache and waste, presuming you even have the means to acquire the massive data you seek in the first place.
Pull Data Up By Its Bootstraps
One method that was developed to address this problem was Bootstrap Aggregation, also known as Bagging. It’s a deceptively simple approach to taking the data you have and faking as though it were “infinite” — or at least a log bigger than what it actually is. It starts with building multiple data sets out of your single data set using Bootstrapping, and specifically Bootstrapping with replacement. Using this approach, you build as many replica data sets as you want (whatever approximates “infinity” or “as much as my poor little laptop can possibly handle” for you) using an approach that randomly picks a data point from your data set and adds it to a replica. It repeats this process until the number data points in the replica equals the number of points in the original. Since the data point that is selected at each iteration is never removed from the original, lots of data points are going to be repeated, and in some cases multiple times.
On large data, it turns out you can expect each replica built using Bootstrapping with replacement to contain about 63% of the original data points, with the rest being duplicates. If you build out a full-depth decision tree using this replica data set, odds are it is going to be **even worse** than the decision tree built using the original data. So what’s the point?
Oh Yeah, That’s My Bag Baby!
The decision tree built on a replica is what is known as a “weak learner” — it’s not great, and we know it’s not great. But it turns out that if you build a really large number of these randomly generated weak learners and average their results (an example of an approach to machine learning called “ensembling” — or combining various machine learning models to generate a prediction), they can mitigate or smooth out some of the noise and other idiosyncrasies in the original data and produce predictions that perform significantly better in the real world.
This is the “aggregation” part of Bootstrap Aggregation. Take a single data set, create 500 (or again, however many makes sense) randomly selected cruddy replicas, build out full-depth decision trees on all of them, and then whatever the average prediction is across all 500 models is what gets returned as the prediction on novel data.
This has a similar effect on training accuracy as things like building max-depth decision trees does — it increases bias (training error). But training error only really matters when your model is biased from the start, and in the case of full-depth decision trees, this is rarely a significant issue. The improvement in testing accuracy and real-world prediction makes any training loss well worth it.
The Lingering Problem With Bagging
Bagging is great, but it suffers from one overarching problem — strong correlation between the decision trees. If you have noisy or idiosyncratic variables in your data that are overpowering in terms of training data prediction, then this can continue to be a problem with your bagged trees model. The way to deal with this is via regularization — intentionally introducing randomness in terms of which variables are selected when building the decision tree models. The approach to do this is called Random Forest.
Random Forest works exactly like Bagging (in fact, it **is** Bagging), but when building the weak learner trees, it makes decisions at each split point based on a random subset of the features / variables each time. The number is usually tuned to be the square root of the total number of variables in the data set, so if you had a data set with 100 variables, each time you evaluated a split point, you would randomly select 10 of them and only use those 10 to make the decision of which data went into which category.
This stochastic approach to building out decision trees on data sets built using Bagging means that those overpowering variables get relatively minimized, and other variables that might otherwise be almost ignored get a more enhanced “voice” in the overall aggregate decisions (not entirely unlike the effect ridge regularization has on linear regression.) Adding this “fuzz” to the decision making process can further mitigate variance in bagged decision tree predictive models — meaning they get even more accurate on real-world and test data.
When you’re trying to build out a decision tree model and variance is an issue, and you simply can’t add more data for whatever reason (or want to test and see if adding data might help…), the Random Forest approach can be an effective way to make the best of the data you have available and improve your real-world model performance.