# The Treat of the Trick

Practical Exploration of “The Kernel Trick”

In my blog post on feature expansion, we discussed how you could manipulate existing features and create new ones, and in the process take data that were not linearly separable and convert them into higher-dimensional forms that might be. I showed a simple example where a two-dimensional data set with a single new dimension (sum of squares of both values) produced the desired result. Easy enough.

But real life is never going to be that simple, is it?

Let’s start with a couple pieces of context. I used sum of square values for my previous example, which is fine and a viable example of feature expansion, but not one that really gets used a lot. It’s far more common to take the product of two features and treat that as the interaction — in part because technology like TPUs and GPUs can do matrix multiplication super quickly, at scale. In addition, we don’t really know what combinations of features are going to give us the most signal in terms of classification and separability, and rather than trying to guess, from an algorithmic standpoint you’d simply want to try every possible combination. And finally, you add one more dimension that was computationally neutral (so for multiplication purposes, a 1) to each vector. We would indicate that this operation has been performed on our vector using the symbol/term “phi of <vector id>.”

The implication here is that our two-dimensional vector (let’s call it A) would be transformed from this:

A = (x, y)

…to this:

phi_A = (1, x, y, x*y)

So a two-dimensional vector transforms into a four-dimensional vector. What happens when we add a third dimension?

A = (x, y, z)

phi_A = (1, x, y, z, x*y, x*z, y*z, x*y*z)

From three to eight dimensions. One more — a four-dimensional vector transforms like this:

A = (x, y, z, q)

phi_A =

(1, x, y, z, q, x*y, x*z, x*q, y*z, y*q, z*q, x*y*z, x*y*q, x*z*q, y*z*q, x*y*z*q)

From four to 16 dimensions. The total number of dimensions generated from an initial starting vector will be 2 raised to the power of the number of original dimensions. Put another way, every new dimension will double the size of the phi equivalent.

Where this gets really hairy is when you’re trying to build the classifier, and specifically a maximum margin classifier. As part of this process, you need to mathematically determine the distance between two vectors (and repeat for every possible combination of vectors) which can be accomplished for Euclidean distance using matrix multiplication inner products. Let’s start small and create the phi equivalents of a couple of two-dimensional vectors to illustrate the complexity here (changing feature notation for clarity):

phi_A = (1, a1, a2, a1a2)

phi_B = (1, b1, b2, b1b2)

To compute the inner product of (or Euclidean distance between) these two vectors, the math looks like this:

Distance_AB = 1+a1b1+a2b2+a1a2b1b2

Not so bad, right? Let’s take it up a level and assume we have three features:

phi_A = (1, a1, a2, a3, a1a2, a1a3, a2a3, a1a2a3)

phi_B = (1, b1, b2, b3, b1b2, b1b3, b2b3, b1b2b3)

And now our formula looks like this:

Distance_AB = 1+a1b1+a2b2+a3b3+a1a2b1b2+a1a3b1b3+a2a3b2b3+a1a2a3b1b2b3

The number of features in our phi vectors determines the number of addition problems our computer needs to do in order to compute the distance. But isn’t that what computers are for and what they’re good at? Simple math shouldn’t be a problem, should it?

Enter our old friend, the curse of dimensionality.

If you’re just dealing with a small number of features, this is indeed not a problem. However, if you have just 20 features, this results in a distance calculation that requires more than one million additions (remember, 2 to the power of the number of dimensions…) Just 40 features puts this number over one **trillion** addition problems. One more feature adds yet another trillion. The next one two trillion. And on it goes.

You’re gonna to need a bigger computer.

## The Kernel Trick to the Rescue

The question then becomes “is there another way to compute relative distances between vectors without doing all of the longhand addition?” And it turns out there is a way. In fact, there are lots of them.

From a terminology perspective, we call the matrix multiplication method we used above a “kernel.” The trick, then, is to substitute the approach with another one that gives us either the exact same answer with fewer computations required, or a set of answers that is at least somewhat representative of what the original kernel would have given us, but perhaps in a different format or scale.

Let’s start with an approach that gives us the same answer but with fewer computations required. Take another look at our distance calculation in two dimensions:

1+a1b1+a2b2+a1a2b1b2

It turns out we can simplify this by turning our four addition problems into two multiplication problems:

(1+a1b1)(1+a2b2)

This “trick” is called the polynomial kernel. Let’s call the number of our dimensions “*d*.” The polynomial kernel takes the number of calculations required from 2 raised to the power *d* addition problems down to just *d* multiplication problems. Our three-dimensional vector distance:

1+a1b1+a2b2+a3b3+a1a2b1b2+a1a3b1b3+a2a3b2b3+a1a2a3b1b2b3

…simplifies to:

(1+a1b1)(1+a2b2)(1+a3b3)

When you get into higher dimensions is where the power of this swap out becomes even more evident. For 40 dimensions, the longhand inner product calculation was going to require more than a trillion additions. Using the polynomial kernel, we can arrive at exactly the same answer using only 40 multiplications, which will run in a very small fraction of the time.

In addition to the polynomial kernel, another popular approach is the Radial Basis Function or RBF kernel. This formula is not a direct correlation to inner product like the polynomial kernel is, but uses a more sophisticated statistical approach which provides similar information. Depending on the nature of your data (and especially in higher dimensions) it can produce even better/faster results than the polynomial kernel. You can also create your own kernel, as long as it follows a certain set of rules (which go mathematically further than what I typically go in my writing… let me know if you’re dying to dig into it.)

The net, though, is that using the kernel trick allows us to replace the traditional, longhand method of computing inner products (and thus Euclidean distances) with another approach that provides either the exact same information or statistically similar information (for classification purposes) using a fraction of the compute power and time. And in the world of machine learning, that’s most definitely a treat.