Core Text Processing Algorithms
How does a computer process a piece of text? So that it can do interesting stuff with it. Such as picking the correct meaning out of a sarcastic remark or discerning spam from legit emails. This post explains four of the essential text processing algorithms: one-hot encoding, n-grams, word embeddings and bag of words.
Linguistics provides means necessary for language understanding. It deals with semantics, syntax and grammar rules etc. These concepts are modelled and applied to automated systems capable of processing unstructured text and extracting relevant information based on the context.
Text processing was originally based on rules crafted by hand. This worked well for extracting information from unstructured text. With the rise of computational power in 1980s, statistical and machine learning models gradually took over. Machine learning and especially deep ML algorithms allow to reason about unseen text. The models are generic and scale well with large volumes of data.
The following section deals with basic text processing algorithms. Is there anything you wonder about this topic and it’s missing? Let me know in the comment section below.
What is One-Hot Encoding?
One-hot encoding is part of data preprocessing. When labels such as “Apple” or “Pear” are fed into an ML algorithm, they need to be encoded, e.g. transformed into a format understandable to a computer. Some algorithms work directly with labels, but many others require a number instead.
Let’s have a set of fruit categories: apples, pears, oranges and bananas. Don’t worry, if the table below doesn’t make sense just yet.
Label | Ordinal | Binary Encoding | One Hot Encoding |
Apple | 1 | 001 | 1000 |
Pear | 2 | 010 | 0100 |
Orange | 3 | 011 | 0010 |
Banana | 4 | 100 | 0001 |
Intuitively, you’d assign each label an ordinal number. Thus “Apple” becomes 1, “Pear” turns into 2 etc. While you might not think twice about it, for an ML algorithm label encoding is somewhat special. The higher the number, the more important the label is. In the same sense as two is greater than one, “Pear” would be seen as superior to “Apple” etc. This would obviously lead to wrong results when processing categorised data.
In order to make all labels equal we need to reach to a binary representation. There are two options, binary encoding and one-hot encoding. Both of them are equally good in solving the problem, but the one-hot one has an advantage of simplicity.
When using the binary encoder you need to decode the binary number to arrive at the right label (011 => decimal 3 => Orange, 100 => decimal 4 => Banana). While with one-hot, the category is denoted by the ordinal number of the digit “one” (0010 => 3rd place => Orange, 0001 => 4th place => Banana).
Word Suggestions with N-Grams
N-Grams are useful for predictions of which word is the most likely to come up next, given a history of N preceding words. It’s essentially an optimisation technique. Rather than analysing the probability of a certain word in the context of the entire corpora, you only consider a few preceding words. A “history” of one preceding word is called a unigram, two words become a bigram, three words form a trigram, four a 4-gram etc.
Let’s assume bigrams, e.g. sequences of two consecutive words. Given a toy corpus “This is a sentence and this is the word. That is a dog.”, it’s not that hard to figure that “is” always occurs after “This”. In other words, the probability of “is” given that previous word is “this” is 1. Similarly, it’s easy to see that “is” is followed either by a definite or an indefinite article. Imagine a hypothetical autocompletion feature. It would automatically suggest “is” after you’ve typed “this”, then it would suggest “a”, keeping “the” as the second best choice etc.
Word Embeddings
The ultimate goal is to understand words in context. As humans we do it intuitively based on our experience. Naturally, we understand that a king is a man and a queen is a woman. In the same sense we simply know that milk, juice or tea are best served for breakfast as a drink. But, unless we talk about the Milky Way, we wouldn’t associate milk with a galaxy. Simply put, it’s all about the ability to identify similar words in the given context.
A computer achieves a similar capability by using so called word embeddings. Rather than understanding words and amassing a huge vocabulary, the algorithm looks at certain features, scores them and groups the words based on how closely their features match. Check this notebook by Allison Parrish. It goes in a great details and is very hands-on. Especially pay attention to how the script evolves from a simple example of grouping by only a couple features to vector arithmetic. Finding the most suitable word when adding and subtracting features translates into a human ability to understand a context switch.
Word2vec, GloVe, Elmo and others are implementations of the word embeddings. They go beyond the limitations of n-grams and leverage neural networks in order to build feature-rich and accurate embeddings.
Bag of Words
Bag of words is a classification technique that allows to quickly find similar documents. The main advantage is its simplicity. Suppose a document worth tens of thousands words. First, we filter out stop words, transfer the contents to lower case and reduce words to their root form. This provides foundation for a word-for-word comparison. Next, we count words in the document and keep N of the most frequent words. This is, by the way, a typical “hello world” example from the big data space. Check this article to see how easy it is to do with tools like Apache Spark.
This gives us the bag of words along with their counts per the document. The bottom line is, there are way more documents then words, especially after we’ve narrowed the word selection. Each document in scope is processed the same way and gets therefore reduced to a bag of words (and their counts). This is a huge simplification, but it allows us quickly and easily compare the documents and guess which ones are similar in terms of their content.