Writing My First Machine Learning Algorithm from Scratch
(Note: My next posts on both R and Python for Machine Learning are going to entail work I’m doing for my final project for both of my Master’s Degree classes. As such, I’m going to delay publishing the results on my blog until after the classes are over.)
There’s a relatively well-known scene in Monty Python and the Holy Grail where the group of knights comes across a rabbit’s lair. The cave is surrounded by bones, and their guide warns them that the “cute little bunny” is actually a sadistic killer. The group believes they are more than capable of handling the threat, and so they send one of their men in — and he is quickly killed by the murderous rabbit. The group, now believing the threat but still disregarding the warnings, decides to attack in full numbers. It ends predictably:
On a related note, I decided to launch into Cornell University’s online Machine Learning Certificate program. On the surface, it looked like a nice complement to my Master’s in Health Data Science Program — a certificate focused more specifically on ML algorithms and theory, whereas the Master’s program is bent more towards practical application. It seemed like a perfect complement. (And on that note, I wasn’t wrong.)
As part of the enrollment and intake process, you are required to take a pre-test. I was expecting a bunch of questions about machine learning algorithms, but instead was presented with some difficult (for me) questions about matrix multiplication and calculating derivatives, concepts that I haven’t thought through for more than 20 years. The pre-test itself took me a couple of hours to complete. It was open book — I was free to look up resources online, and did so quite a bit in order to knock the rust off of some of those skills. Having completed it, I made the assumption that this was a gating mechanism, and that the actual courses would probably be simpler than the pre-test implied.
What a cute little bunny.
It turned out that the math concepts — and in particular for the first course, all of the matrix multiplication — were not only heavily used, but critical to understand in order to implement the final project for the course. In it, we were required to build a function for computing Euclidean distance across matrices, and then use that function (or an equivalent) to create the k-Nearest Neighbors (k-NN) algorithm. From scratch. And **THEN** implement **THAT** as part of a face recognition system.
All kidding aside, the material moves very quickly and expects you to have a decent Python programming background from the get-go. This is not for the feint of heart. However, assuming you get through without too much handholding from the course facilitator, by the end of it you understand at a fairly deep level how the k-NN algorithm works, as well as its relative strengths and weaknesses. This intrigued me, as it was a lot deeper and more narrowly focused than what I was expecting, but in very interesting ways that will certainly prove to have value as I go deeper into the world of data science.
There are a number of videos on YouTube that go into detail on how the k-NN algorithm works, so I’ll just give a very high-level summary: In essence, you take a bunch of labeled data points (for example, pictures of things — trees, rocks, people, etc.) which make up your training data and you decompose them into some kind of numeric value (probably a matrix). Then you take a new picture that has no label, decompose it into a numeric value, and then mathematically determine which points in your training data are most similar to the unlabeled picture. The k in k-NN represents the number of closest data points you are looking for, and is usually an odd number to try to avoid ties. The “nearest” neighbor in our class project was determined by Euclidean distance (i.e. closest in a straight line, or “as the bird flies”) however in some cases you have to determine distance based on available paths (for example, the nearest restaurant from a Euclidean perspective may be three blocks away, but the wrong way on a one-way street, so the restaurant that is five blocks away but in the right direction is actually “nearer” by that perspective), which can get more complex. Once the k number of closest points in the training data have been identified, you simply guess/assign the label of the test point to be the same as the label that had the most closest points.
As an example, let’s assume I’m uploading a picture of me and I set the k value to three. If that picture was mathematically closest to three labeled points, one of them a person and two of them rocks (let’s assume it’s not a great picture and just stop the conversation there…), that picture would be labeled as a rock — even if the actual closest match was the person picture. However, if I set the k value to one, the result would change and the picture of me would now be labeled person.
In reality, you usually end up with a pretty clear result. My rock example would be a very rare exception, and one that could be fixed most likely by increasing (rather than decreasing) the k value. Let’s say the two nearest rock pictures happen to be statues of people, thus the confusion in our data. If we set our k to a higher number — say, nine for example — then it’s likely that seven of nine would be labeled person, and thus the algorithm would provide the correct answer.
There’s a lot that goes into this on the back-end, and I thought the course (and our facilitator) did a really good job of trying to tie a bunch of complex topics into digestible chunks that could be consumed over a two-week time period. While it was challenging, it was also a lot of fun, so much so that I went ahead and signed up for the next two courses. I don’t know what I’m in for on the math side, but I expect it to be both challenging and fun! Then again, my nerd definition of fun may be different than yours, so buyer beware…