• Publications
  • Influence
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
Hierarchical Question-Image Co-Attention for Visual Question Answering
TLDR
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Expand
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
TLDR
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning. Expand
Graph R-CNN for Scene Graph Generation
TLDR
A novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images, is proposed and a new evaluation metric is introduced that is more holistic and realistic than existing metrics. Expand
Neural Baby Talk
TLDR
A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
TLDR
A novel training framework for neural sequence models, particularly for grounded dialog generation, that leverages the recently proposed Gumbel-Softmax approximation to the discrete distribution, and introduces a stronger encoder for visual dialog, and employs a self-attention mechanism for answer encoding. Expand
ParlAI: A Dialog Research Software Platform
TLDR
ParlAI (pronounced “par-lay”), an open-source software platform for dialog research implemented in Python, is introduced, to provide a unified framework for sharing, training and testing dialog models; integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning. Expand
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
TLDR
A self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. Expand
...
1
2
3
...