Two Steps Forward, Two Steps Back
Repeat Until Convergence (Neural Networks and Back Propagation)
In this post, I’ll start with a high-level review of what we’ve learned so far with neural networks and how they work up through a complete forward pass, and then conceptually walk through the back propagation technique to use gradient descent and adjust the randomized weight and bias values to align predictions more closely to actual labels. We will uncover some really neat math effects of using the ReLU activation function, and find out how the chain rule is applied to make finding the gradients across all of the layers a relatively low-effort process from a compute standpoint. This one is a little heavier than my normal post, by necessity, so brace yourselves! That said, hopefully this will help you understand how one of the more complicated concepts in machine learning today works in real-world applications.
Neural Network Review
It starts with the Perceptron, which was the first machine learning algorithm developed and can be used to create a function that separated two classes, assuming a hyperplane exists that can do so.
However, not all data distributions can be so neatly divided. Say, for example, this is your distribution of red dots vs. blue dots:
A standard linear classifier is not going to find a single line/hyperplane that will accurately separate the two classes. When we looked at Support Vector Machines, we learned that feature expansion might be able to come to the rescue here — add one or more dimensions that do some type of interactions to make the classes seperable in a higher-dimensional space — but the problem is we have no idea apriori (before we start) what kind of feature expansion trick might work, so we will probably end up doing a lot of trial and error and burning through a lot of time and compute power, and in the end might not be successful if we don’t happen to guess the right types of expansion to try.
Ugh.
What would be ideal in this situation would be to employ two (or more) linear classifiers and combine them together to produce an outcome like this:
This turns out to be the basis for neural networks, a.k.a. Multilayer Perceptrons. However, it’s not quite as simple as just combining a couple of linear classifiers. In the visual above, we are using two of them. However, if you were to take those two separate classifiers:
…and add them together, you don’t get the nice V shape in the diagram. You actually get the summation / average of the two hyperplanes, so the result would actually be this:
In order to get the basic Perceptron to produce shapes that aren’t simple hyperplanes, we employ a “hidden layer” which utilizes an activation function to transform the simple linear classifier into a more interesting shape. There are several types of activation functions, but the most popular one, by far, is called ReLU (Rectified Linear Unit). ReLU basically takes the results of your randomized wx+b calculations, sets y = x for all positive values, and sets all negative values to 0. This produces a shape with a kink, or hinge.
Neural networks create multiples of these using random weights and bias values in the beginning (including negative values, so you get both positive and negative slopes).
And now when you add these functions together, you get something that looks distinctly different from a standard linear classifier:
As we previously explored, you can get ever more complex shapes to be produced by either adding more neurons to the network (adding width), or by adding additional hidden layers (adding depth, or “deep learning”) like in the example below:
The final result of our initial forward pass through the neural network is a really, really terrible prediction function. Depending on the random initial weights and bias values selected, you might get something like this:
However, this sets us up to use gradient descent to adjust our weights and bias values and start shaping this function through successive iterations (a process known as back propagation) until we have minimized our loss to the degree possible.
Now that we understand the goal, it’s time to figure out how we get there.
Back Propagation
Back propagation is a form of gradient descent, just like we used with logistic regression and gradient boosted regression trees, so conceptually this should be familiar. We calculate our loss (we’ll use sum of the square residuals for our example) and then adjust our weights and bias values to reduce and eventually minimize the loss over successive iterations.
Where this gets tricky is the dependency we have between our layers. The gradient (or first derivative) of our loss with respect to a given weight (which determines the size of our weight adjustment) is relatively simple to calculate at the last layer, however at each previous layer you have to take all successive outputs into consideration.
The Last Shall Be First
Since this is “back” propagation, the first layer we evaluate is actually the last layer we computed in the forward pass. I’ll refer to the weights and bias values at this layer as WL. The preceding layer will be WL-1, then the one preceding that WL-2, and so on.
At the first layer we are calculating the derivative of our loss (sum of squared residuals, or SSR) with respect to our final weight, WL. This is another way of saying: if WL changes, what is the impact on the SSR? Put one more way — what is the slope of the line if WL is on our x-axis and SSR was our y-axis?
If you’ll remember when we looked at logistic regression, we were wanting to modify our weight and bias values until aggregate slope approximated 0 — meaning no further improvements to SSR could be made. The goal here is the same, only we have to do it for a chained combination of weights rather than a single weight. Fortunately, we can start from a single weight — WL — and work our way backwards.
In order to calculate this gradient, we can use the chain rule. It turns out the derivative of SSR with respect to WL is equal to the product of two other derivatives: the derivative of SSR with respect to our predicted values (a.k.a. our loss function) and the derivative of our predicted values at our activation function with respect to WL.
We are using square loss as our loss function, which is simply the square of our actual label minus our predicted label. The gradient/derivative of this function is -2*(actual — predicted). [If you’re not sure how to compute basic derivatives like this, there are a plethora of great online tutorials like this one that can show you the ropes.]
The second part of the equation turns out to be simple with ReLU (which is a theme you will see a couple more times in this blog post…) Since no fancy formulas are employed to calculate the predicted value in the last layer, the derivative of predicted values with respect to WL is equal to the predicted values from the previous layer, or the activation layer output at L-1.
The WL Update
At this point we have what we need to employ Taylor’s Expansion. We simply multiply these values by each other, then take that output times a pre-selected learning rate (Alpha). Finally, we subtract that value from WL to come up with the new WL we will use in the next forward pass.
The Alpha value here is also known as step size or learning rate, and is usually some value less than 1 so that a small-ish step is made towards the minimum. We’ve discussed Taylor’s expansion before, so I’ll refer you to that post for additional details / step size strategies that can be employed.
WL-1 Examination
**Note: The explanation below has been updated to fix an error in the original version.
Now that we’ve calculated the weight update for the first (or last, depending on how you look at it) layer, it’s time to move up to the layer before. At this layer, we need to calculate the derivative of SSR with respect to WL-1. Using the chain rule once again, we calculate this as follows:
Let’s take a look at each one of these independently:
Remember that ReLU transforms negative values to 0, and postive values y = x * the weight at that layer. Therefore, the derivative with respect to X will either be 0 for any negative values or WL-1 for positive values.
Technically, the math falls apart at x = 0, since you can’t divide by 0. However, practically this isn’t a problem. You are either going to get a 0 or WL-1 derivative response from the activation function, so you can choose either of those and apply it when x = 0. It really depends on the nature of your problem and data, but what I’ve seen so far is a default where the response is 0 for all values 0 and below, and WL-1 for all values greater than zero.
The Beautiful Simplicity of ReLU — Part 1
I want to specifically call out here what the role of ReLU is at this stage in this back propagation process: It serves as an on-off switch. If you have a positive value for your input at this step (which is wx+b from the previous step), then the derivative is going to be WL-1, and an update will occur. If, however, you have a negative value for your input, the derivative is going to be 0, which means **no WL-1 update will occur** in this iteration. Furthermore, this is information that can be collected at the point of input during the forward pass. Therefore, when it comes time for back propagation, all you have to compute is the derivative of the loss function with respect to the predicted values. The rest of the values are already known, and if the activation function derivative is 0, you can simply ignore / skip this step in the process during back propagation.
The WL-1 Update
Putting it all together, our update logic will look like this:
And again, if X = 0, we simply choose to use 0 or WL-1 (and commonly 0 is employed to induce sparsity.)
Now, what actual calculations need to take place as part of back propagation? Well, we’ve evaluated the ReLU output, but otherwise, every other variable in this formula is already known. WL and WL-1 were part of the forward pass, we already calculated the derivative of SSR with respect to predictions at the previous layer, and even WL times X was a value calculated during the forward pass.
Thus, the **only** thing you have to do, assuming you do anything at all, is calculate the new WL-1 value given the Taylor’s Expasion update, plugging in all of the already calculated components, which takes almost no time at all in compute terms.
Wait, That’s Really All There Is To It?
Yes. Yes it is.
WL-2 and Beyond
As you traverse layers back to the first one, executing the chain rule means you just keep adding more products to the previous output — specifically, the previous WL-? value times X. HOWEVER, because we’re going to keep multiplying here, if your previous weight update was zero, you can simply stop. Once you’ve gotten a zero in the mix from WL-1 update and beyond, you will not perform additional updates at this pass, since that zero will always be one of the products evaluated, regardless of how things evaluate upstream. In practical terms, the calculations look like this:
With the respective weight updates being WL-? - Alpha * Output. And again, if you got a 0 update at WL-1, you’re done, and you strictly speaking wouldn’t need to continue calculating updates for this pass for the subsequent weights, regardless of the number of hidden layers.
The Beautiful Simplicity of ReLU — Part 2
What this effectively means is, compared to other activation functions, ReLU might make a relatively small number of weight updates per backwards pass. However, because of the dramatically lower compute costs required to calculate the derivatives and resulting outputs (and the fact that all values for the backwards pass can be easily stored during the forward pass and at the evaluation of the weights in WL), you can perform substantially more passes with the same time and compute resources compared to more complex activation functions.
A Note About The Real World
While I have been writing thus far as though each neuron would be evaluated separately, in reality all of these calculations are going to be performed as matrix operations. Therefore, all layers would be evaluated regardless of the outcome of a previous layer because all layers across all neurons are evaluated at the same time in a single operation. While this does mean some extra maths performed compared to the independent neuron logic approach, the trade-off is far, far more than worth it in terms of performance gains by being able to simultaneously perform calculations for thousands (or millions, etc.) of neurons in a single operation.
Two Steps Forward, Two Steps Back
In summary, then, we have now completed both a forward pass through a neural network, reached the end, calculated our loss values, and then calculated the derivative of those losses with respect to the weights at each layer using the chain rule. We would then perform another forward pass, this time using the updated weights, compute our new loss values, and then perform back propogation to once again update the weight values. This process would repeat until our losses reached a small enough value to call our predictive function accurate, at which point we are done.
And that is a neural network end-to-end and back again. In the next post, we’ll tackle one of the major issues around neural networks — overfitting — and some of the regularization techniques that can be used to mitigate it.