Creating Synthetic Languages - NYU Center for Data Science

Why is “unbreakable” such a special word for linguists? Unbeknownst to the general public, the word is actually composed of three smaller units that linguists call morphemes. Referring to the smallest meaningful grammatical unit of a language, the three morphemes in ‘unbreakable’ are ‘un’ , ‘break’, and ‘able’.

Studying the underlying morpheme structures within a word has been one focus of Professor Jason Eisner’s research from Johns Hopkins University. When speaking to CDS’ Text-As-Data seminar, Eisner explained that they have so far used computational techniques to automatically analyze the pronunciations of a set of words to identify the underlying morphemes that they share. Given that some morphemes are pronounced differently depending on their context, their algorithm also captures the specific sound patterns — phonemes — of words.

This research has major implications for those working in linguistics, cognitive research, and engineering. For linguists, we can now ask how the same sounds are used to construct different — or similar — words across languages, enabling us to discover whether our languages are diverse or closely linked. Moreover, tracking which morphemes are typically placed together can help us understand how words are formed.

For those interested in cognitive research, word formation can help us better understand how toddlers learn phonological processes. And, for those interested in engineering, we can now begin to combine languages together.

Creating synthetic languages is a large part of what Eisner has been working on. Eisner and his colleague Dingquan Wang have recently released Galactic Dependencies Treebanks, a set of over 50,000 synthetic languages. To put that into perspective, that’s approximately 7 times more languages than all the ones spoken on Earth.

Part of creating GDT involved mixing and matching language rules that have already been identified in each existing human language. They take an English sentence, for example, and rearrange its words into another language order, like that of Hindi or Japanese.

Of course, these synthetic languages do not intend to usurp or replace current languages. The point of GDT, Eisner explained, is to provide new data for NLP methods that are aiming to adapt to unfamiliar languages. Training on unfamiliar languages can help these models learn how parsing occurs in languages more generally. Capturing this transferable knowledge means that NLP methods in the future may not have to be taught each language one by one. They will know what clues to look for when analyzing a new language, helping them to unfold its structure. Taking his exciting work on GDT forward, Eisner will be focusing on even more unsupervised and concentrated discovery of syntactic structure.

by Nayla Al-Mamlouk