Giving Structure to Unstructured Data
The vast majority of the machine learning most companies practice today is performed on tabular data — i.e. data that fits into nice columns and rows like a spreadsheet, where all the columns mean the same thing (follow a schema, whether it’s enforced programmatically or not), and so on. We refer to this as “structured” data. Conversely, the vast majority of data that exists today exists in the form of freeform documents, images, video, and other formats. We call this “unstructured” data. And while learning how to use structured data in machine learning has already had a dramatic impact on our world, it will be cracking the puzzle of how to best utilize unstructured data in machine learning that will fundamentally change the the way we live. Thus, being able to provide structure to unstructured data and then do something useful with it is a significant focus for a lot of AI-focused organizations.
It turns out that actually converting unstructured data into a structured form that a computer can readily use is relatively straightforward. Take, for example, a typed word — dog. We know when dealing with unstructured text that we are going to be given one word at a time, and these words are going to be broken down into some number of letters. We also know that there are a finite number of possible characters (26 letters in the English alphabet, if memory serves correctly, in both upper and lower case plus numbers and symbols — such as a dash or apostrophe — that might be embedded), there are reasonable limits to the number of characters a single word can contain, they will be separated by spaces, and so on. Using these parameters, it is possible to create a single one-hot coded array that can be used to represent virtually every reasonably useful word in the English language.
“One-hot” encoding basically means you have a category of something — in this case letters and symbols — and you simply create a series of columns (which are also referred to as “dimensions”) and if the thing being evaluated equals a given column, you code that column as 1 and leave all other columns as 0. In our example, we could create a dimension for every possible character that could be represented as the beginning of our word. You’d have 26 columns for upper-case letters, 26 columns for lower-case letters, 10 columns for numerical digits, and let’s just assume 38 possible symbols so that we get to an even 100 possible things that might present in this word.
So we have 100 possibilities for the first character, but we need to be able to represent the entire word, and do so in a structured framework. Today, the longest commonly used word in the English language is about 43 characters long. (The absolute longest word, if you’re curious, has almost 190,000 characters…) To set up our system to be able to accurately represent every possible word, we’d need to allocate enough spaces to account for all of the possible number of characters we might run across. To be safe, let’s assume we’re going to account for up to 50 characters.
So then, to turn our unstructured data — a collection of words in a document — into a structured format that follows an enforced schema, we would convert the word dog into a 5,000 character series of ones and zeroes. The first set of 100 would have a 1 in the lower-case d column, the second set of 100 would have a 1 in the lower-case o column, and the third set of 100 would have a 1 in the lower-case g column. The rest of the array would be nothing but 0’s.
BOOM! We’ve just converted unstructured data into structured data, where every word in the English dictionary in any sentence in any document will be represented in an identical fashion. Problem solved, right?
Well, while this is great and all, it turns out actually making use of this array that represents a given word is computationally challenging. If you’ll recall the blog post I wrote about the k-Nearest-Neighbors algorithm, I mentioned that the computer “simply” computes the distance from a given point to all of the other points in your data set and finds the nearest match(es). The problem in our scenario is that depending on how you count, there are something in the neighborhood of 200,000 words in the English language. For the simplistic k-NN approach, you’d have to compute a mathematical distance from a data point to every single one of those 200,000 words, and your data point and every one of those words would be represented by a 5,000 character array, meaning for every word in your document you would be performing around a billion matrix multiplication steps. If you’re on a decently standard computer, that should take you about one second. Not bad! If that were the whole story…
At this point all we’ve done is identify words at a character level. That has value, but what is really valuable isn’t a word, but its meaning and context. And for that, we need to know some number of words that surround the word being analyzed. Let’s stop there — there are other contexts of interest, but this one is going to be enough to break our model. Remember we said that there were approximately 200,000 words. Let’s say we want to, in our array, account for every possible word that might come before and after this word to provide context. To do this, you’d have to create a column for every possible word that might come before the one being evaluated (so, 200,000 more columns) and then again for the one after (so we’re up to 400,000 additional columns now.)
So now for every word in our document, we need to account for not only what word it is, but the word that comes before and after it in order to ascribe some contextual meaning. We’ve just gone from a billion matrix multiplications that would take about a second on your computer to around 81 billion matrix multiplications, meaning every word we evaluate now takes about a minute and a half. To capture a word and two surrounding words for context.
We’re just getting started down the complexity train, and I’ve grossly oversimplified the issue. Let’s just say you get to a point where without some creativity, you’re going to be waiting a few years to figure out if the document you just scanned was an invoice or a high school research paper on political trends.
There are a number of approaches being taken in data science research in order to try to figure out how to make this a workable problem, and great strides have been and are being made in cutting edge technologies that promise to make this even more workable in the future. It’s a big problem, but presents a huge opportunity for those who figure out how to solve for pieces of it.