Natural Language Processing — Defining Tokens
I have what is now an embarrassing confession to make. Before I started deeply exploring the math, statistics, and theory behind data science, I had a notion that predictive analytics — i.e taking tabular data and using it to predict the future — was the hardest thing to do in data science. My innocent, unaware mind looked at Natural Language Processing (NLP) as the easy stuff you did early in your career, before you had built the experience to really tackle the hard things like predicting future housing prices and supply chain issues.
Fortunately, I never gave voice to this idea in a public forum. If I had, I’m assuming the reaction of professionals in the field might have gone something like this:
Dont’ get me wrong — doing complex predictions, and doing them well, is also very hard. But I was completely underestimating the complexity and the work that would need to go into teaching a machine how to extract meaning from unstructured data. Waaaaay wrong. So in this post, I’m going to start to share some of the things I have been learning recently, and providing appropriate props to all of the NLP experts out there who, if I had made my ignorance known, would have laughed me out of the room (and rightfully so…)
Data Engineering Hits Differently
When it comes to building predictive models, the vast majority of the job — if you’re doing it correctly — is going to be getting your data in the right shape for the modeling approach of choice to work, not the actual model building itself (usually…) With tabular data, this usually entails figuring out what to do with missing data, making sure you have the right data for the thing you’re trying to predict, centering and scaling, and so forth. Getting good data is really really hard, but the rules are generally consistent and decently well-understood.
With NLP? Oh. My. Freaking. Goodness. It’s the wild, wild west out there.
The Fellowship of the Things, by J. R. R. Token
The first step in NLP (and I’m going to caveat by saying I’m still early in my journey here…) starts by taking some text of some kind and breaking it down into an abstraction known as tokens, so that you can use these abstractions to perform machine learning.
What’s a token, you ask? Great question. You see, it depends.
- A token could be a word, meaning every word in a corpus (collection) of documents could be a token, and this sentence would be 26 tokens.
- A token could be a sentence, meaning this sentence would be one token.
- A token could be a paragraph. This would mean blocks of text separated by a return character would represent a single token. This paragraph would be one token.
- A token could be a collection of some number of n words (two word “bigrams”, three word “trigrams” and so forth…), with each word being represented n times in the corpus as the beginning, end, or some middle part of a token. How any bigrams do you see in this paragraph?
- A token could be a part of a sentence — any collection of words separated by any type of sentence punctuation, meaning this sentence would represent three tokens.
- A token could be a type of designation, such as title text, header text, subheader text, and content. The entire list of token options would represent one token.
If it’s not clear, this is by far not a complete list of options when it comes to tokenization.
BY. FAR. NOT. A. COMPLETE. LIST.
So an early (the first?) step in NLP is figuring out, based on the type of problem you’re trying to solve, how to take a corpus of documents and divide it up into the pieces/inputs that will form the machine learning model.
We haven’t even gotten into things like deciding whether to removing stop words (a, and, the, etc. — words that don’t carry much signal… usually… and the caveats here are killer…), whether to use stemming/lemmatization (breaking words down to their root — thus “running” becomes “run”, “played” becomes “play”, and so on) and how to do that (so that you avoid words like “coed” becoming “co”), and whether to keep things like token order and count or just look at the collection of unique tokens, how to deal / what to do with punctuation, etc. etc. etc. etc.
It’s enough to push a guy over the line, and we’re not even really getting started yet!
There are a variety of tools and packages out there designed to help ease some of the technical bits for application, and then there’s always regex for the corner cases. But the overwhelming truth of the matter is that just getting started with NLP likely means taking the exact same data and creating many different representations of that data in order to determine which representation produces the most signal — not for predicting the future, necessarily, but as a starting point, just for giving the machine learning algorithm some idea about what it is dealing with in the first place.
Saddle up folks. This journey into data science is just getting started.