Writing My Second Machine Learning Algorithm Using Naive Bayes
Making assumptions has a bad reputation. You know what assuming does, right? (For the record, I narrowly decided against making the popular answer to that question the title of this blog post…) A significant amount of energy has been poured into self-help books and videos to try to show us how making assumptions is detrimental to ourselves and to society in general. I could on a different blog channel go on and on about how profiling (which I’m defining here as making assumptions about persons based on race, gender, appearance, etc. thus introducing bias into everything from police work to job interviews) is a blight on our society and needs to be mitigated. Assumptions have the potential to do a lot of harm, provide significant misinformation, and in the field of machine learning, assumptions baked into data have the potential to introduce significant and harmful bias. However, it turns out in some cases — and in particular, cases where data is limited but relatively complex — making assumptions is the only practical way to derive value from data and find what we’re looking for.
I just finished working through the second class out of seven for the Cornell University Machine Learning Certificate. This one was also very interesting, and I thought very different from (and maybe a little easier than) the first one. It switched from a focus on pure math / linear algebra to more of a statistics focus, which is something I’ve conceptually stayed closer to over the years. It opened up with a discussion on Bayes Classification, in which you build a predictive model based on the labels of previously identified data.
As a simple example, say we have collected data on people who buy clothes. Our store sells two shoe colors — red and brown, and we want to know what color of shoes a particular customer is going to buy. Our data indicates that a high percentage of the time, people who buy socks and t-shirts in a previous visit end up purchasing red shoes on their next one and people who buy socks and hats end up purchasing brown shoes. We could use this data, find out that a new person had just purchased a pair of socks and a t-shirt, and then use Bayes to predict what that person would buy next.
Simple enough when you have a small number of predictive features you’re working with, but what happens as you increase the number of predictive features? Say instead of two or three features, we have 30 we are trying to track. One of the things Bayes acknowledges is that features are interconnected, or to some degree dependent on one another, and those interdependencies can complicate the math quickly as the number of features increases. For example, let’s say we also care about the color of the items purchased beyond shoes. You would need to record that the customer purchased socks, and their color, as well as what specific t-shirt (if any) was purchased based on the colors available. Each item is a broad category, and each color option within each item is a unique data point value.
To use Bayes to make predictions on what specific type of thing you might buy next, you look for exact matches in the data — people who bought exactly what you bought and nothing more, nothing less. Finding an exact match across 30 data points would be significantly harder than when we were just looking at two or three, but on the plus side if you had enough exact matches it would also make better predictions.
However, what if the next person you evaluate has a wardrobe profile that does not match exactly with any other person in your data? Then standard Bayes simply will not predict what that person will buy next. Even if this person matches 29 out of 30 data points for others in the data set, Bayes cannot mathematically predict what they are likely to do, since it’s never seen that exact combination before.
Again, this might not a huge problem if you’re just looking at 30 data points and have thousands of observations of customer data to build from. You’re likely to find matches in that case. But what if instead we are looking at 300 data points? Or 3,000? Or 3 million? And what if instead of thousands or millions of observations, we just had a few hundred? Your odds of finding any exact matches now becomes vanishingly small. In those cases, and it turns out quite often in the real world, standard Bayes turns out to be completely unusable.
Enter the Naive Bayes Assumption.
What if, instead of looking at data as 300 (or 3,000, etc.) dependent features and looking for exact combinations of interdependent features, we instead made the assumption — a *naive* assumption, to be sure — that all of our data points were completely independent? If that were true, then we could calculate the probability of each feature independently, and then get the aggregate probability by multiplication (chain rule). We take our problem from the nearly impossible to solve realm of a single 300+ feature prediction and turn it into 300+ one-feature predictions, which are each really simple and relatively quick, and then take the product of them all to get the answer. Genius, right?
Only in this case, we know our assumption is almost always wrong to some degree or another. For example, take a seven-character name. According to the Naive Bayes Assumption (and specifically, the Gaussian — or normally distributed — Naive Bayes Assumption), stringing together the letters “aaaaaaa” to constitute a name would be just as likely as “annette.” In this case, the features are assumed to be completely independent, and thus the prior letter in a sequence has no bearing on what the next letter will be. We know this is not true, that there are rules and conventions, and that by making this assumption… Well yes, maybe we’ve made the math doable, and now we can start generating predictions with even a small amount of data. But we are knowingly introducing error into the equation.
The question, then, becomes *how much* error are we introducing? How dependent are the features, and to what degree does this dependence matter in terms of the training labels and predicting an outcome?
While nothing is perfect, Naive Bayes turns out to be a really useful approach in a number of cases. With even a relatively small number of observations, we can make assumptions about the distribution of our data, and then make predictions based on the presumed distribution rather than the data itself. In some cases, when are assumptions aren’t too far off, those predictions can be valuable, and may approach reality.
Let’s go back to our wardrobe scenario. Let’s say we have a number of data points such as sock and t-shirt color, gender, age group, and music preference. We need to turn everything we know into a vector of ones and zeroes. So for socks, let’s assume we have 5 color choices: red, blue, white, black, and brown. We could code the color of socks someone purchased by giving each one of these a spot in our vector, and setting the appropriate value to 1 while leaving the rest as 0. That gives us five dimensions, one of which will be coded 1. Let’s assume we have 8 t-shirt colors, so the next 8 bits are coded as zeroes except for the bit that corresponds with the t-shirt that was purchased. We are now up to 13 dimensions. Gender could be male, female, or other, so we code that as the next three bits (16 so far). For our purposes, age group is broken down into five categories (21 bits now) and music preference into nine more. Coding our data to match then gives us a vector with 30 bits, each one indicating yes or no to a specific data point.
Now assume we have this data for 200 customers, and in addition we recorded whether or not those customers also purchased a pair of shorts within a week of making their initial purchases. Let’s assume that 25% of those 200 customers did indeed purchase shorts and 75% did not. If we take the subset of customers that made the purchase and group them together in a matrix, we can calculate the mean of all of the one’s and zeroes for a given feature (column) to come up with a probability that that particular feature indicates a yes to our shorts question if our test data point for that feature is coded as a one. Put another way, you end up with a vector with 30 values, each one being the aggregate probability of buying shorts based on a having 1 encoded in a given spot.
That’s just the beginning of the process, and there’s a fair bit more theory and some math implications that force you to use log values and such, but we now we have all the data we need to work with to build out a Naive Bayes Classifier that predicts — given a customer’s characteristics — whether or not they will purchase a pair of shorts within the next seven days.
And, let’s be honest — that’s pretty cool! If it works. Whether or not it works probably depends on the relative predictive power of our collected data. Odds are, there are other important features that we did not include — for example, time of year and weather, whether or not shorts had already been purchased, income level, sales and advertising, and other things that probably heavily influence these kinds of purchasing decisions. But it stands to reason that given enough data points and observations, you could do a reasonable job of making decently accurate predictions on a grand scale, without being bogged down by the strict rules that standard Bayes would hold you to. And you’re making these predictions relatively quickly and cheaply compared to other ML models which require a lot more data and a lot more work (not to mention a lot more compute power) before they can start providing value.
Thus, while Naive Bayes isn’t the end-all, be-all of machine learning algorithms, it has powerful capabilities and flexibility that likely mean it will be an important tool in the data scientists tool kit for a long time to come.