April 14, 2017

Creating A Multi-Genre Corpus for Natural Language Inference

Although natural language processing (NLP) has made major strides in the last few years, to what extent can an NLP algorithm understand human sentences beyond a superficial read? Although they can computationally identify, count, or regurgitate individual words, phrases, and sentences, can they capture the meaning behind the words that they are handling?

These questions are at the heart of a fledgling sub-field within NLP called Natural Language Inference (NLI), where CDS professor Sam Bowman’s work is currently located.

A typical NLI test runs something like this: the algorithm is given a sentence like “A man is playing with a dog,” and then a hypothesis about that sentence like “Two dogs are playing together.” The algorithm would then be asked to compare the  sentence with the hypothesis, and infer whether the hypothesis is contradictory (e.g. false), neutral, or an entailment (e.g. true). In this case, the hypothesis is a contradiction, and whether or not the algorithm is able to infer it as such reveals the extent to which the algorithm has understood the sentence.

For researchers to perform these tests, however, they need a corpus of sentence and hypothesis pairs. Such is the purpose of Bowman’s multi-genre NLI corpus, an exciting project that received a Google Faculty Research Award earlier this year.

Working with Adina Williams and Nikita Nangia from NYU, and Angeliki Lazaridou from Google DeepMind, Bowman and his team are creating a multi-genre NLI corpus that builds on his earlier work, the SNLI corpus. The SNLI corpus is a collection of 570,000 human written sentence pairs that were manually labelled. Since its creation, the SNLI corpus has become a vital benchmark for researchers in the field.

But a drawback to the SNLI corpus is that all of the sentences were extracted from a single genre—image captions. The multi-genre NLI corpus that Bowman and his researchers are now working on, however, addresses this problem by extracting written sentences from several different areas. In addition to the SNLI image captions, the multi-genre corpus collects sentences from fiction, government documents, news articles, telephone transcriptions, travel guides, the 9/11 report, face-to-face speech, letters, nonfiction books, and magazines.

By collecting both written and spoken text, this massive multi-genre corpus will help researchers test and identify problems with their current algorithms. It also takes us one step closer to learning how to foster transferable language skills in machine brains. Presently, the project is harnessing the power of crowd-sourcing to manually label the sentence extracts. The multi-genre NLI corpus will also be the basis for the RepEval 2017 shared task later this year—find out more about it here.

by Cherrie Kwok