EDIT: Marble in a Steel Bowl Part 2

8 min readOct 18, 2021

This post was significantly edited December 2024.

Gradient Descent Concepts Further Defined

So you wanted a data science post? Well, here’s your data science post. :)

When I last posted about gradient descent, I compared it conceptually to dropping a marble in a steel bowl. The bowl itself is invisible, so you don’t know where the bottom is (might be at 0 loss, probably larger). However, you do know that if the marble hits the bottom or close to it, it’s going to bounce straight back up. If not, the marble will hit the side and bounce in the direction of the center. Once you observe the direction of the bounce, you adjust your position in that direction and drop the marble again, and keep doing this until the marble bounces close enough to straight back up for you to call the center as being reached.

All in all, a pretty decent analogy.

What I want to do in this post is to conceptualize some of the math concepts that go into gradient descent and make it work, and then explain how you know which way the marble bounces (and thus, the direction you need to move to get closer to the bottom of the bowl), how you figure out how far to move, and so forth.

Let’s start with a basic question: What is this “bowl” we’re referring to, and how do we know it’s convex to begin with?

The bowl in logistic regression refers to the loss function. Your goal is to find the minimum loss possible using a linear classifier (a.k.a. a “Perceptron” model: wTx + b — not to be confused with the Perceptron algorithm, which is just one way to get to this kind of model…) You acheive this minimum loss by iteratively modifying the values for w and b such that they produce a hyperplane that gets as close as possible to the minimum loss possible.

We know the loss is function is convex because we are imposing the sigmoid (a.k.a. “logistic”) assumption on our loss values — mathematically speaking, we are not allowing anything other than a bowl-shaped loss function with a single minimum value (assuming classes are not linearly separable — if classes are linearly separable, our bowl has a flat bottom with more than one possible combination of minimum loss values, but most data are not so lucky…) If the data are linearly separable, you can acheive a zero loss and stop whenever that point is reached. If the data are not linearly separable, we continue to tune w and b values until we determine we have found the lowest loss point.

This is actually the reverse, assumption wise, from the Naive Bayes classifier. There, we imposed an assumption of normality on the distribution of our features — which is a very, very big assumption, but lacking any evidence to the contrary, it’s about the best starting place we know of in statistics. The outcome / effect of this in a two-class data problem usually results in what is effectively a linear classifier with what happens to be a sigmoidal loss function. With logistic regression, we start from that end point — imposing the sigmoidal (or “logistic”) loss function on our data, and then working backwards from there to define the classifier itself. The advantage here is if you have enough data, we are no longer assuming our features are normally distributed, and this can allow us to produce a more precise linear classifier if we have good training data and enough of it.

So back to gradient descent then — first off, gradient is another way of saying “slope,” so if you remember basic geometry and that slope equals rise over run (or change in y over change in x, in a two-dimensional space), you’ve got that basic idea of what a gradient is. In two dimensional space, it’s just a line defined by a formula wx + b (again, basic geometry here — although you probably learned it as mx+b instead of wx+b.) In matrix math / numpy notation that turns into w.T@x + b. Another way to think about gradients in terms of functions is that they are the tangent of a given point on a function’s curve, which can be calculated using first derivatives.

Thus, if we know we have a convex loss function, at any combination of w and b values we can use the direction of the first derivative results to determine the tangent / slope of those points in the convex loss function. We don’t know how far away we are from the center — not at first anyway — but we know which direction we need to adjust each w and b value in order to get closer to it.

You start by making a guess about what w and b values should be (usually you give them all 0 values before the first update — which is almost certainly not going to be the optimal solution…), figure out your loss and your gradient values, adjust in the direction of the negative gradient, and then check again.

Let’s take a look at this visually. We start with a one-dimensional data distribution (labeled x’s and o’s) that all happen to be positioned above the x axis. The two classes are separable, so zero loss is acheivable. Our initial guess with logistic regression is all zeros, so our decision boundry lies on the x axis itself.

The algorithm predicts the x class for anything above the decision boundary and o for anything below it.

Initially, then, the algorithm is predicting all of the x labeled points correctly, and all of the o’s incorrectly. This results in an error rate of 44%. The first derivative (gradients for w and b) indicates that we need to move both of them in the positive direction.

Logistic regression then adjusts w and b in the direction indicated, however the step size itself is not predetermined. For our simple illustration, we will use an initial step of 0.5, which results in a new prediction / decision boundary. This first step is a doozy, improving our error rate to only 18%.

According to our gradients of w and b, we still need to move both in the positive direction.

And so we do. Our next step will be 50% of the magnitude of the previous step, so now we draw a new decision boundary with w and b values of 0.75 (an increase of 0.25 each). However, this actually results in a higher error — meaning for one or both of our values, this was a step too far.

The gradient results here will indicate adjustments in the other direction:

…and so we adjust slightly backwards. This process repeats until zero loss is reached, your adjustments are no longer producing improvements in the loss value, or until you run out of time / money / compute for additional accuracy.

One big thing I’ve left out up to this point — how do you know how much to adjust your w and b values at each iteration? The unfortunate answer is “you don’t.” You know the direction, but there’s no way to know the magnitude of the adjustments needed. You just have to guess at how much adjustment to make. Taylor’s expansion is a process in which you make small gradient adjustments by adding the gradient of w (or more precisely, the negative gradient, since gradients by definition ascend, so for “descent” you do the opposite) times some alpha value to your weights, and do the same for your bias value.

This alpha value is kind of arbitrary, but the wrong one can have a big impact. Set it too small, and your w adjustments are very small — you are only making tiny movements towards the center of the bowl every time. Sure, you’re getting closer and closer to the bottom, but you’re moving so slowly that it might take forever for you to get there. On the other hand, set it too large, and you are making big moves in the direction of the bounce — possibly past the bottom of the bowl and over to the other side (like we saw in the example above.) Your adjustment risks bouncing first off one side, then the other, then back on the other side, and on and on without ever having a chance of actually hitting the bottom, just changing the direction of your bounce at each iteration. That’s also not good.

One way to approach this is to start with a relatively large alpha value — say, 1, for example, and then at each iteration make it smaller and smaller. That way, in the beginning you are taking pretty large steps, but as you get closer to the center, you are taking smaller and smaller steps so as not to overshoot too far until you get to a minimum step size that makes sense. The adjustment for offset b at each iteration follows a similar pattern.

And that, folks, comprises the theory behind gradient descent. Make an assumption about the nature of your loss function, then use that assumption and find the slope of the resulting label distribution such that you can approximate the bottom of the convex shape it makes.

EDIT: Marble in a Steel Bowl Part 2

Written by Jason Eden

No responses yet