When you text a friend saying ‘I’ll fall you later’, how does your iPhone know to correct ‘fall’ to ‘call’? Auto-correct owes its prowess to a field that continues to gain paramount importance among computer scientists, and is an especially lively area of study at our very own Center for Data Science: Natural Language Processing (NLP).
Generally speaking, part of NLP research involves calculating ‘the joint probability distribution of words’ in a language. In other words: researchers working in English, for example, use algorithms to analyze large cachets of English documents and texts, and calculate which words most frequently appear beside each other in various contexts, or words that share semantic similarity (synonyms). After identifying dominant word patterns in the English language, researchers can then write programs that will predict what word may come next in a sentence or a paragraph (‘probability distribution’).
NLP research not only makes features like auto-correct possible, but also has extraordinary implications for academic research. For example, sophisticated NLP programs could eventually help literary scholars or historians fill in the missing words in aged, damaged, or illegible manuscripts.
Today, there are a number of approaches to capturing and understanding language patterns in NLP. A popular method is word2vec, which transforms words into vectors. Words that often appear close to each other or share semantic similarity end up sharing a similar space on a graph like the one below, which depicts word vectors related to ‘good’ and ‘bad’ words.
The steeper the incline between each word pair (eg. ‘good’ and ‘evil’), the more they differ in semantic meaning. And, the word clusters represent associations: ‘rich’, ‘important’, ‘healthy’, and ‘good’ share a similar space on the graph because these words are most often used together, suggesting that wealth is (unsurprisingly) tied to social importance and physical well-being.
But a problem with the word2vec approach is that it requires reading a massive text corpus to produce its complex calculations. A more recent approach, ‘Eigenwords’, is faster and more efficient for NLP work as it uses spectral decomposition to calculate the joint probability of words within a scalable matrix or ‘context window’. For example, if a researcher defines the context window for a particular word like ‘cat’ as ‘3’, the Eigenwords algorithm would identify the three words that most frequently appear before and after ‘cat’ in a corpus, thus capturing the underlying word patterns within a specific context in a shorter amount of time and with less computational work.
Building on Eigenwords, CDS Master’s student, Raul Delgado Sanchez, has been researching how diffusion maps can take NLP research a step further as a part of his Mathematics of Data Science course. Like the Eigenwords approach, a diffusion map analyzes data within a specific context window, and the distance between the data points describes the relationship between those points. Diffusion maps can be applied to NLP research, Sanchez explained, where words become the data points, and the distance between those points “represent the chances of ‘walking’ from one word to another”—in other words, the probability that those words appear close to (or far from) each other.
After building a similar matrix that Eigenwords uses, Sanchez conducted a series of short experiments to demonstrate how diffusion maps can calculate the joint probability distribution of words to similar standards achieved by other models, suggesting that diffusion maps may become a promising area for further NLP research.
by Cherrie Kwok