• Publications
  • Influence
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
TLDR
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB. Expand
Grounding of Textual Phrases in Images by Reconstruction
TLDR
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets. Expand
Speaker-Follower Models for Vision-and-Language Navigation
TLDR
Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark. Expand
A dataset for Movie Description
TLDR
Comparing ADs to scripts, it is found that ADs are far more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Expand
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
TLDR
It is quantitatively shown that training with the textual explanations not only yields better textual justification models, but also better localizes the evidence that supports the decision, supporting the thesis that multimodal explanation models offer significant benefits over unimodal approaches. Expand
Movie Description
TLDR
A novel dataset which contains transcribed ADs, which are temporally aligned to full length movies are proposed, which find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Expand
Coherent Multi-sentence Video Description with Variable Level of Detail
TLDR
This paper follows a two-step approach where it first learns to predict a semantic representation from video and then generates natural language descriptions from it, and model across-sentence consistency at the level of the SR by enforcing a consistent topic. Expand
Women also Snowboard: Overcoming Bias in Captioning Models
TLDR
A new Equalizer model is introduced that ensures equal gender probability when gender Evidence is occluded in a scene and confident predictions when gender evidence is present and has lower error than prior work when describing images with people and mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men. Expand
The Long-Short Story of Movie Description
TLDR
This work shows how to learn robust visual classifiers from the weak annotations of the sentence descriptions to generate a description using an LSTM and achieves the best performance to date on the challenging MPII-MD and M-VAD datasets. Expand
Robust Change Captioning
TLDR
A novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning, which learns to distinguish distractors from semantic changes, localize the changes via Dual Attention over “before” and “after” images, and accurately describe them in natural language via Dynamic Speaker. Expand
...
1
2
3
4
...