Descent into Madness using Logistic Regression (My 4th Machine Learning Algorithm)
In the Linear Classifiers course in the Cornell Machine Learning Certificate program, you end up implementing two classification machine learning algorithms: the Perceptron (which I discussed in my previous blog) and Logistic Regression.
Logistic Regression turns out to be a confusing name, or it was for me when I first heard it, because typically a “regression” algorithm for machine learning is **not** a classification (0 or 1), but rather an algorithm designed to predict a specific value — a commonly used example being the sales price of a home given its characteristics. In fact, the similarly named Linear Regression — which **is** a regression ML model — was also discussed and demonstrated in the same class.
We didn’t actually implement Linear Regression in the lab ourselves. It turns out solving a Linear Regression problem with matrix multiplication has what is referred to as a “closed form” solution, and in this case an entire class of problems can be solved with a single line of code. This code finds the hyperplane that results the the lowest margin of error across the points in the data set, and while doing the math by hand would be tricky and take a while, it’s super quick and easy for a computer. The course walks through a basic Linear Regression demo, but I guess the implementation is easy enough they didn’t think it was worthwhile as a graded exercise. I tend to agree with that decision.
While Linear Regression will spit out a best guess for a specific answer to a question, Logistic Regression returns either a yes or no, 1 or 0, positive or negative response. By this time in the certificate series, you’ve already implemented two other classification algorithms — the Perceptron and Naive Bayes. Each of them are useful in different scenarios — Naive Bayes when you don’t have a lot of data, your data are low-dimensional, and you can get away making assumptions about the distribution of the data; and the Perceptron when you have lots of data, the data are high-dimensional ,**AND** a hyperplane exists that can cleanly separate the two. So why a third one?
Logistic Regression, it turns out, solves problems that the Perceptron could not, and can use lots of actual data without having to make assumptions about the distribution of it, meaning it can be more accurate than Naive Bayes. The second part makes a lot of sense to me: predictions based on actual data should be more accurate than predictions based on assumptions about data, so if you have it, use it! But how Logistic Regression solved the Perceptron problem turned out to be the crux of the class from my view.
In the real world, data are noisy, and there are likely lots and lots of data sets where points with different labels end up intermingling to some degree or another. Since the Perceptron needs to be able to get the labels right 100% of the time on the training data, this makes it impractical for a significant swath of real-world machine learning challenges. Given the simplicity of the approach, there’s simply no way to produce a reliable answer if a definitive separating hyperplane doesn’t exist.
Logistic Regression ends up solving this problem in part by using gradient descent to approximate (as close as possible), rather than precisely define, a separating hyperplane. I won’t go into the heavy math here, but basically I equate the solution to dropping a marble in a steel bowl and seeing which direction it bounces. The maths in Logistic Regression result in a solution that lies along an unknown point in a convex line. This is handy, because we know right away there is a single minimum point (there’s only one point making up the bottom of the bowl). We can, as usual, start with a guess as to where that minimum might be, and once again use 0’s to form the separating hyperplane matrix to start, just like with the Perceptron. We “drop the marble” and learn based on some algebra (yay!) and calculus (ugh! 🤓) which direction leads to the minimum point. We use this information to adjust our hyperplane guess, and then drop the marble again a little lower in the bowl and see how it bounces (i.e. how much more accurate are we now?). This continues through some number of iterations or until our loss (the measure of inaccuracies in our predictions of the training data) reaches the point that we are “close enough” to the actual minimum — which may or may not be possible to determine with exact precision— to call our model good. Or, to finish our marble analogy, we continue until the marble bounces closes enough to straight back up to satisfy our requirements.
This bypasses the Perceptron problem because it does not require 100% accuracy — meaning we can generate a powerful predictive model even with noise in the data set. And in situations where you have lots of data, the predictions should be substantially more accurate than what you would get using Naive Bayes. While this was the most complicated algorithm to implement so far, the power and the value proposition of Logistic Regression for classification problems was evident, and now that I understand what’s going on behind the scenes, I feel a lot more confident about using it to solve real-world problems.