The Likelihood of Tails vs. “Not Heads”
When discussing probability, a commonly used example is a coin flip. It’s a 50/50 proposition, presumptively, that you will observe either heads or tails on any given flip. If you assign labels like h and t to heads and tails, you would write that out as the probability of heads = 50%, or P(h) = .5. The presumptive probability of tails would also equal 50%, or P(t) = .5 as well.
That part is pretty straightforward, in theory. In reality, very little is actually that straightforward. For example, even a simple coin flip is subject to all kinds of laws of physics — slight speed or strength change in the “flipping thumb” will absolutely impact the likelihood of getting one result vs. another. The starting position of the coin (heads vs. tails up) matters. Things in the environment — indoors vs. outdoors, wind speed and direction, other weather, and even things like temperature — can have an impact. The type of coin, and its specific weight distribution, matters. Maybe you follow astrology or similar and believe the day of the week favors one result or the other. And on and on. So if you are trying to take a coin flip and make a determination about the likelihood of ending up with a certain result given a set of conditions, to be as accurate as possible you’d want to track and measure everything that you think might influence a given outcome. Assuming you are going to turn this into a data science problem, you need to turn all of those individual data points — or “features” — into binary values and track them individually. Our tracking sheet might look something like this:
Each of the columns in our data represent a different feature, except for the last one which is our label — or the result. Each row is an observation. For each feature that applied to a certain observation, we would enter a 1, and for all other features we would enter a 0. For our result, we would put in a +1 for heads and a -1 for tails. Let’s assume over a course of several weeks we made 10,000 observations for training data and collected the results of every coin flip. We now have the basis for throwing machine learning— and specifically Naive Bayes — at this problem of predicting how the next coin flip might land.
Let’s craft some assumptions: In those 10,000 coin flips, we observed 6,300 heads results (recorded as 1) and 3,700 tails results (recorded as -1). Oooh, intrigue already! Not exactly as close to 50/50 as you might think. Now, let’s craft what I will refer to as a probability profile for each result. What do I mean by “probability profile”? I mean for each feature, what percentage of the time was it true (set to 1) when a given result was observed? Again, we’re just making up numbers here, but let’s assume the probability profiles for each result looked something like this:
(For the pedants in the audience: Yes, I know that these labels are very much not conditionally independent, and thus I’m stretching the credulity of the Naive Bayes assumption. On that note, if you’re looking for a blog post that tells you how to accurately predict a coin flip in the real world, you’re on the wrong page, friend.)
To get those numbers, I had to subset the 6,300 heads results from the training data, and then take the mean for each column (the % of time it was recorded as a 1 vs. a 0). I then did the same thing for the 3,700 tails observations.
The next step is to calculate the likelihood ratio for every observation that the result I observed was the one I **should** have observed, given the probability profile for that result. For our purposes, likelihood is defined as the probability of an observed result divided by the probability of the inverse result. If all the data we were collecting were the results of heads and tails, that would be the probability of heads / the probability of tails. So going by our observations, the likelihood ratio of heads to tails would be 63% / 37%, or approximately 1.7. Thus, given a coin flip, we would expect the results to be heads 1.7x more frequently than tails.
And that’s where things get dicey, if you’re not careful and observant.
The problem we run into is this: in the simple example at the start of the blog, heads and tails are our labels, **AND** since they’re the only features we are observing, they happen to be the inverse of each other. Inverse probabilities means if you add them together, you will get 100%. Or another way to put it,
This turns out to be a really unhelpful, and even confusing example then when you try to generalize to a more complex dataset. Let’s take a look again at our probability profile for heads in our more complex data:
Since we are making the Naive Bayes assumption, each one of our features is assumed to be independent. Thus, each feature plus the inverse of that feature — needed for likelihood ratio calculation — should equal 100%, or 1. With that in mind, let’s take a look at our tails probability distribution again:
Note here that if you sum the first column (which happens to be “Type of Coin: Penny) for heads and tails, you get .45, a far cry from 1. Therefore, we can conclude with certainty that our observations of heads are not the inverse — on a feature by feature basis — of our observations of tails. Or, put another way, the inverse of each column for heads does not equal that value for tails.
In order to compute a likelihood ratio for heads given a set of observed conditions, then, you cannot use the values for tails. Instead, you have to impute the inverse of heads.
In our first feature column, for example, the inverse probability for Type of Coin:Penny = .79, or 79%. Now, we already know that the probability for tails in this category is actually .24, so we can’t just assume our labeled data = tails. Instead, we have to explicitly call out an imputed inverse of 1 minus heads for this value instead. And then, when we evaluate tails, we have to do the same thing again — impute the inverse. So instead of two possible values, heads and tails, you end up with four values: heads and inverse heads (or “not heads”), tails and inverse tails (or “not tails”).
Then, when you get to the actual prediction part, you’re not saying it’s definitely one or the other, but you’re evaluating which one is most likely given the observed features, which becomes taking the probabilities of each given the observations for each possible outcome you have (using the chain rule) and finding out which one comes out to be more likely given those observations.
This was a tricky concept for me, because I kept going back to “if it’s not heads, it’s tails” and my math wouldn’t come out correctly for what are now obvious reasons. Once I figured out that what I needed to use was the inverse of heads, and that was not equal to tails, then the rest fell into place relatively easily. Hopefully, if you’re struggling with the same concepts, this helps you as well.