# L2 Regularization: What’s Actually Going On?

I find it interesting that there are a *lot* of folks in data science that work with and build complex predictive models who have somehow missed the basics of how those modeling approaches work. I get it — you learn something well enough to meet a specific objective and move on — but it really can be a detriment to your understanding of what machine learning is and what it does.

One of those core conceptual misunderstandings is L2 regularization, and so I’m going to dissect that concept in simple terms here.

Also, I’m going to avoid any jokes pertaining to “regularity” in any other sense of the word. You’re welcome.

# Why Consider L2 Regularization?

L2 regularization begins with an assumption that extremes exist in our data that are not reflective of reality. It is a mathematical tool to penalize a model for making extreme judgements about a feature. The goal is to compensate for idiosyncrasies in data that can lead to models that make suboptimal decisions in the real world.

Every feature (or variable) in a dataset is subject to potential relative sampling error. By “relative” I mean a feature value risks either being generally more impactful or generally less impactful than that feature actually is in the real world. This presents problems when building parametric machine learning models because they’re going to come to an optimal solution based on the data they have available, and it’s not until you try making predictions using that solution that you discover it doesn’t line up as well as you’d hope to reality.

L2 regularization results in a model with an overall increase in the loss value on the **training data** used to build the model. If our data is actually reflective of the real world, this would be a bad thing. However if the assumption about extremes is correct, the adjusted model will perform better on real-world, unseen data, which is the ultimate goal of all machine learning.

# L2 Regularization Definition and Considerations

L2 regularization adds a penalty term to your model’s loss value based on squaring the feature weights the model computes, summing them all up, and multiplying the result by a tunable hyperparameter (usually referred to as either “alpha” or “lambda” depending on the author.) I will refer to this value as simply a constant.

L2 Loss = ((Sum of Squared Weights) * A Constant) + Model Loss

For example, say you have a model that has an overall loss of 32 without regularization. This model has three features, and the weights for each feature are as follows: w0 = +4, w1 = +0.1, and w2 = -1. For now, we’ll assume our hyperparameter constant value is set to 1. Thus, the imputed loss of our model under L2 regularization equals:

((Sum of Squared Weights) * A Constant) + Model Loss

((16 + 0.01 + 1) * 1) + 32

…or approximately 49. (The effect of the 0.01 on the model is small enough we can ignore it for practical purposes.)

Note that this higher loss value is just an arbitrary number. We have not changed the model weights at all, so the L2 regularized value of 49 is equivalent to the non-regularized value of 32 because it is literally the exact same predictive model.

We’ll look at some additional examples in the next major section and how this modifies the model in the end. For now, as long as you understand how we get to the value of 49 here, you’re good to go.

## Parametric Models Only

Because L2 regularization is applying a penalty based on feature weights, only modeling approaches that make decisions based on weighting features are candidates for L2 regularization. Thus, non-parametric modeling approaches such as nearest neighbors or decision tree-based models are not candidates for L2 regularization. Non-parametric models have other ways of accomplishing similar things (for example, increasing k for kNN or lowering the max depth for decision trees) but technically speaking, they do not employ L2 regularization in the mathematical sense.

## How Much Do You Trust Your Data?

The overall impact of L2 regularization is tuned via the constant hyperparameter. If you believe your data are generally pretty representative of reality and you just want to nudge model weights a bit, you can set this hyperparameter value to something small (say, 0.1 or 0.01 perhaps) and the overall effect of L2 regularization will be present but minimized. On the other hand, if you believe your data are likely not representative of reality (perhaps you simply don’t have enough of them, or maybe the data collection practices leave something to be desired) then you can set the hyperparameter value to something large (maybe a value between 2 and 10 or larger, depending on your data…) so that it will have a substantial impact on the overall model weights.

For example, in our first calculation where we reached a loss value of 49 when the constant value was 1, setting the constant to 0.1 would have resulted in:

((16 + 0.01 + 1) * 0.1) + 32

Again, we can ignore the 0.01, so we end up with 17 * 0.1 + 32 or 33.7 as our imputed loss. The L2 regularization didn’t have much effect on the loss value at all.

On the other hand, if we had used a constant value of 10, the imputed loss value would have been 170 + 32, or 202. In this scenario, the L2 regularization has a bigger impact on the loss value than the model loss itself.

Again, we’ll look at the ramifications here in a bit. For now, just make sure you understand how adjusting the constant value changes the overall loss value calculations.

As a reminder, in the examples we’ve used so far, the model weights themselves have not changed. All we’ve done so far is look at different ways to impute a loss value on the exact same model.

## The End Result

In order to get to the optimal solution, L2 regularization will tend to reduce the magnitude of large absolute weight values and conversely may increase the magnitude of smaller absolute weight values. Put another way, all else equal it will tend to try to make weights more similar to each other (magnitude-wise) than they might otherwise be without L2 regularization. It’s a smoothing function.

# Examples

Machine learning algorithms either use matrix multiplication or gradient descent (or combinations and derivatives of these) to determine the set of weights that produce the lowest possible loss value. Let’s examine three possible values along the way for our model that might have been intermediate points along the way in finding the optimal example:

Without L2 regularization, the model simply finds the values that produce the lowest model loss, so in this case Model 1 would be deemed the “optimal” model based on our training data.

Let’s say we suspect we might benefit from L2 regularization, so we decide to apply it with a Constant (lambda/alpha) value of 1. Now when we look at the loss values:

…notice that it is now Model 3 that gives us the lowest loss value, not Model 1 any more. Thus, the regularized best model has changed based on our decisions and approach.

Note that in this example compared to Model 1 we have reduced w0 by 25%, but we have increased w1 by 1,000%, and increased w2 by 100%. The absolute value sum of our weights is actually larger for Model 3 (6) than it is for Model 1 (5). (This will come back if we look at L1 regularization in a future post…)

We made much larger percentage moves in w1 and w2, but because they started at smaller magnitude scales, the overall change based on L2 regularization was a lower loss (14 vs. 17) which overcame the difference in the original Model Loss (34 vs. 32).

## Changing the Constant

What does this look like if we decide to give more weight to the regularization effect? Let’s change the C constant to 2:

Note that now Model 2 would be deemed to be the best model of the three.

If we were to make the L2 regularization constant a small value, say 0.1 or 0.2, then the effect wouldn’t be big enough to change the overall modeling approach and Model 1 would be picked as the optimal model.

Once again, let me reiterate — **The increasing or decreasing loss values due to regularization do not change the model weights themselves.** All that we are doing is determining which model is best based on the combination of the model loss plus the L2 regularization effect.

How do we know what the best value for our constant C is? You’d need to test it just like anything else, and perhaps employ Grid Search or some variation in order to find the optimal value that gave you the best possible validation loss. Training loss is only relevant as the model loss input to the overall loss calculation when picking the best model to validate.

# Summary

Hopefully this blog post helps make clear the purpose of L2 regularization, how it works, and the impact it has based on the constant (alpha/lambda) hyperparameter value applied.