Kyunghyun Cho is an Assistant Professor at NYU’s Center for Data Science, and conducts research in the field of natural language processing. His recent paper, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” proposes to use an attention-based model for image description.
Can you give us a bit of background on why you choose to look into the subject of image description?
One big question in the field of machine learning and artificial intelligence research is whether there exists a single, generic learning mechanism that can work with any type of data and task. Can we build an artificial neural network that works both on text and images? Can the deep convolutional neural network—which is widely used in object recognition—also work well with natural language text? These questions motivate much of my research.
Before I came to NYU, I was conducting research at the University of Montreal, and my colleagues and I had developed a machine system that could translate a sentence into a target language. So then, I wondered: are the properties of neural machine translation generic enough to work on other types of problems? We noticed that, to an artificial neural network, an image is not too different from a natural language sentence, once it’s transformed into a real-valued continuous representation: they are both a bunch of numbers! This naive but important realization motivated me to work on translating an image to its description.
Was the task to be able to identify a single object, or was the task to figure out how objects stand in relation to each-other?
The goal was to generate a natural language description that would describe how parts of an image stand in relation to each other, as opposed to tagging which objects are in the image. Imagine an image where a dog is chasing a cat. Object detection will return a set of objects—the dog and the cat—but image caption generation will tell you that the dog is chasing the cat.
What are some of the practical applications for having computers being able to describe an image?
This sort of technology could be a new era for blind people. If you had this technology embedded into something like Google Glass, the image description system could tell a blind user what is going on in front of them.
At any point in your research did your objectives change? Were there any assumptions you had going in that had to be adjusted once you started to analyze the data?
When working on this kind of short-term research project, the first thing I do is to set my expectations for the outcome. What do I expect to learn from this project? How will this let me make another step toward the ultimate goal of understanding intelligence? In this case, my expectation was that a neural network can automatically figure out a sequence in which different parts of a given image are described. With this expectation, my colleagues and I began our experiments.
As with any scientific discipline, we began by designing a model based on our observations of the phenomenon; in this case, we were looking at how humans describe images. After the first model was built, we evaluated it to see how well it met our expectations. That greater understanding led to an improved model, and that cycle continued.
What were the findings of your paper?
We found that an artificial neural network is able to extract complicated, underlying structures across multiple modalities (in our case, an image and natural language text.) Now we’re wondering how far we can push this: what is the limit on the number of modalities one network can handle? What kind of structures can a network find, other than simple alignment between a word and an object?
I would imagine that the next step for this sort of technology would be to analyze videos. What sort of leaps in technology are needed to get to that point?
There are a lot of leaps in technology needed until we can generate a full-on video description. The first problem is a computational issue. Even a high resolution image is is manageable, but this is not true for videos, as a video can consist of hundreds of thousands if not millions of images.
The other problem is that image description relies on supervised learning, where annotations are available for each image. To use the dog and cat example, at this point we have to tell the model where the dog is and where the cat is, before the model can tell us that the dog is chasing the cat. We have plenty of images that have been previously annotated, but previously annotated videos are harder to come by.
To solve this problem of creating annotations, I think there is a lot of potential in the fields of semi-supervised and unsupervised learning. In unsupervised learning, we aim at building a machine learning model that can learn without strong supervision, and would be able to create those annotations.