Big Data, Big Questions: How Does Tumblr’s Graph-Based Topic Modeling Work?

What makes Tumblr stand apart from other social media platforms lies in the unique way its users communicate with each other. Each user has their own highly customizable blog where they can post and share content—like articles, images, GIFs, or videos—or re-post content published by another user. Sharing and re-posting content is not only key to how social connections are formed, but also how trending and popular topics are established, since the user must tag each post that they publish.

But with over 335 million microblogs, how can Tumblr keep track of which topics are most popular? While handling Tumblr’s massive data set is already a challenge, another part of the problem is interpretability. For example, one user may tag an image of Pikachu as ‘Pokemon’, while another may tag it as ‘Pokemon Go!’

Crafting computational solutions to streamline related tags together into a single topic is a major part of Nicola Barbieri’s work at Tumblr as a Senior Data Scientist. At last Wednesday’s Research Lunch Seminar group, Barbieri explained how his team is currently using a graph-based topic modeling process to identify topics on Tumblr’s platform.

Although a popular technique for performing topic modeling today is Latent Dirichlet Allocation (LDA), its methodology hits some stumbling blocks when applied to Tumblr. One problem is that LDA typically relies on the data scientist establishing in advance how many topics there are in a data set.

Yet, as Barbieri stated, Tumblr is unable to know or predict how many topics there may be, as the platform is not only too large but also prone to particularly dynamic fluctuations as their audience comprises of users as young as thirteen.

But a graph-based, semi-supervised machine learning approach to topic modeling, Barbieri demonstrated, solves all of these problems. Assuming that tags with the highest number of followers and users are the most popular, this approach allocates each tag with a score that is based on the number of followers and subscribers based on the following formula:

score (tag, topic) = f (w(tag,topic), subscribers (tag))

This fascinating approach is scalable and straightforward, and elegantly solves the problem of not knowing how many topics there are in a given dataset. It also allows Tumblr to a more detailed picture of what content their users are sharing, for macro-topics and the micro-topics within those macro-topics can be identified.

by Cherrie Kwok

NYU Center for Data Science