# L1 and Elastic Net Regularization

--

In the previous blog post I provided a relatively simple overview of L2 regularization, how it works, and why you would consider using it. In this one, we’ll look at another common approach — L1 regularization — and discuss what it does and how it’s very different from L2. We’ll then take a look at how the combination of L1 and L2 regularization — known as “Elastic Net” — works and how to think about what it does to your modeling approach.

# L1 Definition

L1 regularization looks very similar to L2, but it is doing something dramatically different. Whereas L2 regularization penalty was the sum of the squared values of the model weights times a constant, L1 regularization is the sum of the absolute values of the model weights times a constant.

L1 Loss = ((Sum of Absolute Weights) * A Constant) + Model Loss

The goal of L1 regularization is to minimize the overall total weight values (thus minimizing the L1 loss penalty), and if possible, reducing as many of them as possible all the way to 0 or close to it while continuing to minimize model loss. Whereas L2 regularization sought to smooth out values between weights, and as such actually amplified features that would otherwise have smaller weights, L1 regularization will put as much weight as possible on as few features as possible, allowing us to completely ignore certain features altogether.

Similar to the previous blog example, say you have a model that has an overall loss of 56 without regularization. This model has three features, and the weights for each feature are as follows: w0 = 10, w1 = -4, and w2 = 1.5.

What does this look like with L1 regularization? For now, we’ll assume our hyperparameter constant value is set to 1. Thus, the imputed loss of our model under L1 regularization equals:

((Sum of Absolute Weights) * A Constant) + Model Loss

((10 + 4 + 1.5) * 1) + 56

…or 71.5. Again, the higher loss value is an arbitrary value — this is the same model that without regularization had a loss value of 56. Nothing has changed because we’re adding the regularization term.

However, here’s where things get interesting. With L1 regularization, we have a simple question to ask: if I decrease the magnitude of a feature weight by x, is the additional Model Loss greater than or smaller than x? If the answer is that the Model Loss increase is less than the weight value, then the algorithm will make the model weight smaller until this is no longer true.

Let’s take a look at a couple of additional models and their unregulated loss value:

Without regularization, Model 1 would have been selected as the best of the three by our algorithm. However, now if we apply our L1 regularization with a constant value of 1:

The other two models are now judged to outperform Model 1, and Model 2 gives us the best overall L1 regularized loss. Note that in this version of the model, we no longer need to consider w2 — the weight is zero, so for future versions of the model we build out, we can drop this feature altogether.

As with L2 regularization, we can weight the regularization effect more heavily and see if and how that changes our evaluation. Say for example we set the constant to 3 instead of 1:

The more weight you give to the L1 regularization term, the better Model 3 is going to look since it has the lowest overall sum of absolute model weights, and indeed at any C value higher than 2, Model 3 is going to be evaluated as the lowest loss model. Even if we set the C value to 2 and the L1 regularized loss of Models 2 and 3 were equal, Model 3 would still be considered the superior model because it allows us to ignore both w1 and w2.

We could generate a final version of the model that only contained w0 and base all of our test predictions on that one feature. We no longer need to consider w1 and w2, as long as the L1 regularization assumptions being made were accurate to begin with.

Practically speaking, you would almost never use L1 regularization on a model with only three features. In small data situations like this, you’re probably looking for data addition and enrichment opportunities rather than feature rationalization. However, let’s say your data contained three-thousand features instead of just three. Depending on your modeling approach, this can take model training and serving time up fairly dramatically, potentially to unworkable levels. In this situation, creating a baseline model and using L1 regularization to identify features that can be dropped to 0 or close to it would be extraordinarily valuable, as the compute time to model around 300–400 features can be orders of magnitude less than 3,000 or more. Even if you give up a little in the way of predictive performance, you’ve gained a lot in the flexibility and clock-time performance of the models on smaller datasets.

# Elastic Net

Elastic Net combines L1 and L2 regularization. It computes both values using the same procedures as before and then weights each one according to the relative weight you want to give L1 vs. L2 regularization. (The two weights will equal 100%.) You could write it out like this:

(L1 percent) * ((Sum of Absolute Weights) * the Constant) +
(1 — L1 percent) * ((Sum of Squared Weights) * the Constant) +
Model Loss

So if you wanted to try building a model with a 50/50 split between L1 and L2 regularization, you’d set the L1 percent value to 0.5. If you wanted to do mostly L1 regularization but throw in just a little L2 regularization effect, you might set the L1 percent value to 0.9. And so on.

In the end, the best values and weights might be guessable based on your knowledge of the dataset, but you might also need to try various combinations via grid search or some derivative of it in order to find the optimal values.

# Summary

In summary — whereas L2 regularization tends to smooth out the magnitude of the features in your dataset, L1 regularization goes the opposite direction and tries to minimize the number of features that have to be considered at all, even if it means increasing the magnitude of already large variable weights. The right choice between the two depends on the nature of the data you are working with. In situations where both might be useful to more or less degree, you can use Elastic Net and adjust the relative weighting of the regularization penalties until you find the optimal solution. Regardless of the regularization approach, your end goal is to produce models that perform as well as possible in the real world. For L2 regularization, this usually means better overall predictions. While this might be true for L1 regularization as well, the focus is usually on eliminating extraneous features from the data when you have too many of them for the compute resources you have available, thus allowing you to build and iterate future models on smaller subsets of data.

As with everything in data science, a great deal of hypothesis building and experimenation is required in order to come up with the optimal values and approach for a given data problem, and in some cases you may find your data reflects reality well enough that you don’t need to use regularization at all! The larger and more representative your data are, the less benefit regularization will provide. The only way to know for sure in most cases is trial and error.

--

--

Data Science & Cloud nerd with a passion for making complex topics easier to understand. All writings and associated errors are my own doing, not work-related.