Transforming Capabilities By Writing My Third Machine Learning Algorithm
There are so many Transformers® jokes made possible just by the name of this ML algorithm alone that I’m struggling to stay on task.
I’m really, really struggling.
I have an earworm that goes something like “Autobots wage their battle to destroy the evil forces of… the Perceptetrons.” For those of you who aren’t familiar with the 80’s version theme-song that, to me, still defines the franchise, here’s what I’m talking about:
OK, trying to focus now. Foooocuuuussssss…… Perceptron — not an evil robot race bent on world domination, but rather, one of the first (if not the absolute first) machine learning algorithms ever invented. The Perceptron is a linear classifier like Naive Bayes, so we are going to label something as either one thing or the other (spam vs. not-spam, for example.) However, unlike Naive Bayes, the predictions the Perceptron makes are based on actual data rather than assumed distributions of our data based on maybe just a few observations. Let’s take a look at how a really, really basic version of this works.
Assume you have a data set that contains two, two-dimensional data points: (4,6), which is labeled +1, and (-2,5), which is labeled as -1. We are looking for a two-dimensional hyperplane that divides these two points such that the dot product of both of them result in a number that has the right label — in other words, the dot product of our hyperplane by a positively labeled point results in a positive value, and the dot product of a negatively labeled point results in a negative value. You can start with a guess from anywhere, but by convention you start at a hyperplane with all dimensions as 0. So for our 2-dimensional data, we’ll start with (0,0).
We will iterate through our data set one point at a time and see whether the dot product gives us the right type of value. By definition, our first “guess” is going to mislabel all data points (since the dot product result will be 0, so neither positive nor negative) so right off the bat a correction will need to be made. In order to adjust the hyperplane, you’re going to add or subtract the value of the mislabeled point — adding positively labeled points, and subtracting negatively labeled points. Our first data point is(4,6), which has a positive label, and we add that to our vector which gives us (0+4),(0+6), or (4,6).
Now we take the adjusted vector and check it against the next data point we come across. The dot product of (4,6) and (-2,5) is 22, a positive number. We were looking for a negative number this time, so this has been mislabeled. Time to adjust the hyperplane again.
Since the mislabeled point was negative, we **subtract** it from our vector. The resulting math: (4-(-2)),(6–5) gives us a new hyperplane of (6,1).
Now we have reached the end of our two-value data set, but the Perceptron won’t stop until its final hyperplane produces the correct answer 100% of the time against every data point. Therefore, we need to go through our data set again and make sure all dot products of this hyperplane by every point give the correct label. Here’s how the math worked on the second iteration:
And with that, we’re done. We have defined a hyperplane, (6,1), whose dot product by each observation results in the correct label for every point in our training data set. (Note that there are many, many hyperplanes that actually separate those two points, and (6,1) may not be the optimal one. The Perceptron isn’t looking for optimal. It’s just looking for the first one it can find that works.)
It can get a lot more complicated than this — for example, imagine 100 dimensions, or 100,000 dimensions. (Not math I want to do by hand…) and then in some cases you have to consider an offset, or b value, that gets added to or subtracted from your dot product and can turn an otherwise positive dot product result negative (or vice versa). But at a base level, this is all the Perceptron really does.
There are a couple of caveats, but the biggest one is this: If a hyperplane that separates the labeled training data 100% of the time doesn’t actually exist, then the Perceptron will not produce a result. As they explain in the class, this turns out not to be a problem in really high-dimensional space, but the smaller the number of dimensions, the more likely this becomes. In our example, I only used two data points, so with 100% certainty I knew finding a hyperplane to separate them would be possible. But if there were 100 2-dimensional data points, and some of the negatively labeled data were mixed in with positively labeled data? The Perceptron might never reach a conclusion.
While the Perceptron really isn’t used much in modern data science, many of the principles it employed formed the foundation for a lot of the machine learning algorithms in use today. It was a revolution and well deserves its place in the AI/ML Hall of Fame. And I’m not just saying that because I’m afraid that my car might actually be an evil robot in disguise… honest!