Tokenization Level 2 (or so…)

Jason Eden
4 min readSep 13, 2022

Demystifying the Magic Behind Large Language Models

When I first started reading/watching people in the field talking about complex encoding and decoding systems for NLP (including transformers and various other neural network architectures), if I’m being honest, it sounded (still sounds?) like magic. If I were to take some of the things they say and reinterpret it the way my brain translated it, it would go something like this:

“The data get passed through the first layer, and then a second layer, but the first layer is added into the second layer, which encodes meaning into the overall phrase being evaluated. This gets replicated across multiple layers, and voila! Add a bunch of vectors together and you get a mathematical construct that understands the meaning of the text being fed into it.”
— Jason Eden’s very early understanding of NLP

The above summarization is, of course, incorrect for most large language models. However, it wasn’t until I went through the Hugging Face course on the transformers() package that I finally saw an example of what was going on that made everything fall into place (I think). I highly recommend checking it out, if this is an area of interest for you and, like me, you hear these mystical sounding dark arts of neural networks and wonder how they actually work.

(Full disclosure: this is me learning out loud, so if you’re an expert in this and see errors in how I’m describing this, please let me know!)

Exploding “Vocabulary”

If you’ll recall from my first blog post on tokenization, a token can be defined in any number of ways (a lot more than I listed in that post, actually.) One of the ways you can define a token is with an “ngram” where n is the number of words that are considered to be a single token. And here’s some magic: the same word (or part of a word) in a sentence can be represented by multiple tokens in the same machine learning model. Remember that the token itself is simply a unique vector representation, which can be mapped to a simple number. Thus, you may have a total number of unique words of some size, but a vocabulary of tokens that is many multiples of that size.

For example, let’s examine the word “bright” in these sentences:

  • The spotlight shone on her, the bright light hurting her eyes.
  • He is not bright when it comes to math.
  • With cotton clothing, bright colors fade in the wash.

Each sentence uses bright to convey a very different meaning. When we are tokenizing this corpus, we might have ngram sizes ranging from one to five. For the first pass, n=1 tokenization, the word bright in each sentence would get the same value because context is irrelevant for that token. However, the other ngrams that contained the word bright would each have different values. We might have n=3 ngrams that looked like this (with some text processing to remove some words and punctuation…):

  • her bright light
  • not bright when
  • clothing bright colors

…and for n=5:

  • on her bright light hurting
  • is not bright when it
  • cotton clothing bright colors fade

Each one of these ngrams would have a vector value as well, which again would be represented by a unique number, and these more complex tokens actually contain some contextual meaning in terms of prediction. For example, if I asked a model to complete the phrase: “When I wash certain clothing, bright” it has the context to predict that the next word could be colors, and then follow that up with the word fade, because it has a set of tokens that matches the input closely.

In the end, then, when you see a pre-trained machine learning model that talks about having billions and billions of parameters, what they actually (usually) mean is they have taken a vocabulary of some unique number of words much smaller than that and trained their model such that it recognizes not only words, but their typical placement in sentences and the words that usually precede and/or follow them. The more of these ngrams that you can compute, the more examples your model has, and thus the better it gets at predicting the correct way to respond to a prompt.

I’m no James Randi, but…

(Click here if you don’t know who James Randi was.)

I’m vastly, vastly oversimplifying this, but the key point here is it’s not magic. When building these large language models, your model is basically memorizing longer and longer phrases and using lots of examples of these longer phrases (plus some fancy math and data processing, which we may talk about later) such that the model gets good at predicting what the next word in a sentence is going to be simply because it has lots of examples of something almost exactly like what you have written.

There’s a lot more to tokenization, which is a critical part of what makes NLP machine learning models so powerful today. However, at the end of the day, once you understand how it really works, it’s not magic. It’s just a lot — and I do mean a lot — of number crunching. For me, that makes the whole thing seem a lot more approachable. Hopefully it does the same for you.

--

--

Jason Eden

Data Science & Cloud nerd with a passion for making complex topics easier to understand. All writings and associated errors are my own doing, not work-related.