Defining Dueling Distributions
Binomial and Normal Distributions in Naive Bayes
For a simple Naive Bayes classifier, we deal with two types of distributions for data and predictions: binomial and normal.
Binomial Distributions: Pick a Side
A binomial distribution is one in which all possible values are either 0 or 1. Another way to think about this is “inactive” vs. “active,” 0% vs. 100%, or “False” vs. “True.” In any case, binomial distributions in data have only two possible values of 0 and 1.
This becomes important when building certain types of feature vectors / embeddings. You might be taking unstructured data (such as letters or words) and converting it into a feature vector in order to perform math on it. Each feature or column in your data represents the presence or absence of a certain idea — for example, “the first character is a vowel.” This is either true or not, so the value is either going to be 1 for true or 0 for false. There is no sense in saying “kinda” — at least not in the English language — and so there is no logical reason for a value to be anything other than 1 or 0. The feature vector in this case is a collection of these columns where whatever you are evaluating is being checked for truth vs. falsehood, or active vs. inactive, and coded as a 1 or 0 accordingly.
For example: say I define a vector / embedding scheme that has four features: first character vowel, last character vowel, more than five characters in word, and more vowels than consonants. If I take the word “language” and use this logic, I would convert it into this four-dimensional feature vector as follows: 0 1 1 0
Normal Distributions: Freedom and Flexibility
In contrast to binomial distributions, normal distributions are those where the features can be any value from 0 to 1. We make assumptions about how features are distributed via the famous bell curve (meaning mean, median, and mode values are all going to be the same), but the actual value of a given feature can be literally any value between 0% and 100%.
Combining the Distributions for Naive Bayes
We end up using both of these concepts in Naive Bayes. We have training data that contains multiple observations of something that we convert into binomial feature vectors, and then we separate them by class and create normal probability distribution vectors which we can use to make predictions.
Using Distributions to Make Predictions with Naive Bayes
Say for example we have four keywords in a document that is classified as “academic” for our purposes: language, speak, analog, and alibaba. This creates a data matrix for our academic class of:
0 1 1 0
0 0 0 0
1 0 1 0
1 1 1 1
Because we are assuming a normal distribution, we can calculate the average values for each of our features to come up with a probability vector which we can use for Naive Bayes predictions:
.5 .5 .75 .25
If I have a new word — clown — and I need to determine the probability that it belongs in the academic category, I can use the chain rule and Bayes Theorem to calculate the probability that this word belongs in that class. (I won’t go into it here, but the answer is about 4.7%.)
In order to check if this is the right prediction to make, I need another class to compare it to. Let’s assume I had other words that fit into a category of “fun” and after creating their embeddings and calculating the average feature values, they had the following probability vector:
.2 .3 .4 .6
(Obviously this training data had a different number of observations than our first class. There’s no rule against that!)
We can now again use the chain rule and Bayes Theorem to come up with an aggregate probability. (This time it’s around 13.4%.)
Now I can make my prediction — since 13.4% probability is more likely than 4.7% probability, if I am choosing between those two classes, I predict that clown belongs in the “fun” class. (With apologies to any of my friends who took a challenging academic path in clown school…)
Practical Application
This simple example actually calls out a number of core tenets of data science. If you know what signals are valuable in a certain type of data (my four-feature example probably wouldn’t be great in real life, but might not be a bad start for a higher dimensional vector / embedding scheme…) then you can create the embeddings using rules that extract those meanings via binomial distributions. You can then take two or more classes of data, convert each relevant observation into an embedding, and then take the average values for each feature in each class to come up with a normal distribution probability vector. This probability vector, plus some basic statistics and the naive assumption that allows us to use the chain rule to calculate aggregate probabilities, can then be applied to an unknown observation and used to find the most likely class that this new observation belongs to, which we can then use as our prediction in a machine learning model.
You will see this basic principle applied all over machine learning today, well beyond the simple Naive Bayes example illustrated here, and understanding how this works will help you wrap your arms around much more complex topics in data science.