A Practical Algorithm for Topic Modeling with Provable Guarantees

Originally developed as a text-mining tool, topic modeling has seen uses in fields such as genetics and bioinformatics. It is a popular approach to finding ideas within large text bodies that could not feasibly be manually analyzed by humans. With the digitization of much of humanity’s written knowledge, topic modeling for insight has become ever more viable.

The Center for Data Science’s Yoni Halpern and David Sontag collaborated with Princeton researcher Sanjeev Arora on a 2013 paper, “A Practical Algorithm for Topic Modeling with Provable Guarantees,” that produced a new topic modeling algorithm producing results comparable to the best contemporary approaches, while running vastly faster.

In topic modeling, a document is seen as a mixture of topics; those topics are assumed to be made of of a distribution of words, which are all present in a vocabulary. This means every word in the document is considered a word token, traceable to a topic within the document.

Arora’s, Halpern’s and Sontag’s paper analyzed previous algorithms, and said that a previous algorithm provided by another Arora project “provably recovered the parameters of topic models, provided that every topic contains at least one anchor word with non-zero probability only in that topic.” However, the paper said previous algorithms’ runtimes were slow.

Arora’s original algorithm from his previous project had two steps. The first step selected anchor words for each topic; if a document contained a topic’s anchor word, it was guaranteed that topic was present in the document. The second step, the recovery step, generated topics around the anchor words. The first step required the solving of linear programs, which slowed down the algorithm. The authors of the paper said that they cut down on the runtime by using a combinatorial anchor selection algorithm that did away with the solving of linear programs.

The paper also targeted the second recovery step, which originally reconstructed the topic distributions using matrix inversions. This step considerably slowed the algorithm’s performance. The researchers developed “a simple probabilistic interpretation of the recovery step that replaces matrix inversion with gradient-based inference,” further speeding up runtime.

The researchers then trained models on two synthetic data sets, where they ensured model assumptions were correct, as well as on real documents, to judge real-world performance. They used a corpus of 295,000 New York Times documents, as well as one comprising 1100 abstracts from the conference “Neural Information Processing Systems”.

The resulting algorithm performed as well as algorithms currently in use, such as Markov chain Monte Carlo algorithms, and according to the researchers, “runs at least an order of magnitude faster, and as much as fifty times faster on large datasets.” The researchers also said that the processing speed of the new algorithm would allow real-time analysis of large data streams.

In conclusion, the researchers noted that their algorithms could be paralleled, meaning that the execution of different parts of the process could be carried out simultaneously. This means that the algorithm could potentially support web-scale topic inference, topic modeling on a global-class level of computing.

NYU Center for Data Science