Create the Problem, Sell the Cure
Intentionally Creating and Then Remediating Bias using Boosting
One old con goes something like this: approach a healthy person and convince them they have a malady based on symptoms that universally apply to nearly every human being. For example: “Do you sweat when you run? Does it get worse if you run faster? Oh no…” Then, produce a bottle of “medicine” that will supposedly treat, if not outright cure, said malady. If the target is convinced that they are in danger should they not address the issue, they’ll likely pay for whatever is in the bottle (which is probably just water, if you’re lucky…) and the salesperson moves on to the next victim. The stuff in the bottle, the supposed cure to the disease the salesperson inflicted on you, is usually referred to as “snake oil.”
What does this have to do with machine learning, you ask? Well, step right up and let me explain it to you!
Not much. But intentionally inflicting a problem and then providing a cure does share a surface-level similarity with Gradient Boosted Regression Trees (GBRT for short).
As we’ve discussed a couple of times now, the problem with decision trees is overfitting and variance. One of the ways to reduce overfitting is to limit the maximum depth to which a tree can build. This automatically results in an increase in bias — or, an intentionally inflicted increase in the error you get with your training data. As long as this improves testing error and real-world predictions, we typically don’t care, since that’s where the value of the predictive model comes from.
However, sometimes (often?) you’ll run into a situation where the increase in bias doesn’t do enough to improve variance, and what you end up with is a model that has both high variance (testing error) **and** high bias (training error.) The blunt-object approch of simply limiting the depth a tree can build will only take you so far. You might go back to the drawing board and try to reduce variance using Random Forest, but there’s another way to potentially address the problem, and maybe get even better results: Boosting.
Boosting is a concept in which you train a machine learning model with high bias — meaning for decision trees we are again intentionally building a shallow tree to avoid overfitting. But unlike Bagging, in which we create replica data sets and build lots of models on similar data, with Boosting we simply adjust the predicted labels in the direction of the average prediction based on the direction and magnitude of the training error. We multiply this error by some learning rate and then subtract it from our labels to create “residuals.” We then use the exact same data and build another decision tree, but this time with these residuals as our label values. This has the effect of reducing bias (more on that in a minute), and since we’re not overfitting, usually reduces testing error as well! We repeat this process until either our residuals approach zero or until our bias stops moving in a meaningful way between iterations.
Being honest, this concept was really hard for me to wrap my head around, but I came up with an example that will hopefully illustrate how it works. Let’s build a very simple data set with only 7 observations:
Now let’s build a decision tree and only allow it to split once. (Note: on much larger data, you’d probably split more than this, but maybe only three or four times.) The predicted value for each observation will be the average values of the leaf node in which it resides. So for our first iteration, if we assume this was our distribution:
Then we end up with training error that look like this:
We take this error times a small learning rate, say 10% or 1% or whatever makes sense given the nature of the data, and then create our first iteration of residual labels (denoted by “t” in the image below) by subtracting this weighted error from our labels:
We then iterate through this again. This time we build the decision tree using the residuals as the labels, and as a result, one of our data points gets moved to the other leaf node, affecting mean calculations on both sides:
Note: If no observations moved to a different node, the predicted values would not change.
This results in a new set of error values vs. the residual labels, which look something like this:
When you compare the square loss of these new residuals vs. the previous ones, you get a marked reduction in aggregate training error / squared loss:
The skeptics in my audience might then point out that the second set of training errors were against the previous residuals and not the actual training labels. Not to worry though — the math holds up regardless of how you’re looking at it. Comparing the loss of the new predicted values on the original training labels compared to our first iteration still results in an improvement:
The really cool part about GBRT is that you are intentionally introducing bias into a decision tree, and then mitigating it over successive iterations (gradient boosting, which is a similar concept to the gradient descent approach we examined with logistic regression.) Unlike snake oil, which usually did nothing at best, with GBRT you are improving both training and testing error at the same time without adding any new data or emulating infinite data using replicas.
Whether GBRT or Random Forest does a better job depends on the underlying data itself, and in some situations, some random chance. I have personally run comparisons of performance between the two methods where, depending on the seed I set when initializing the approach, the final test accuracy might be within a few percentage points at each iteration, and between the models, falling within a range such that the seed selection actually determined which modeling approach performed better.
Today many data scientists default to GBRT more than they do Random Forest, but that may not always produce the best results. Both ensemble approaches are valid ways to reduce variance in decision tree machine learning models.