• Publications
  • Influence
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
  • 1,832
  • 448
  • Open Access
Hierarchical Question-Image Co-Attention for Visual Question Answering
A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, weExpand
  • 732
  • 116
  • Open Access
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likelyExpand
  • 622
  • 95
  • Open Access
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
  • 211
  • 47
  • Open Access
Graph R-CNN for Scene Graph Generation
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation ProposalExpand
  • 196
  • 28
  • Open Access
Neural Baby Talk
We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slotExpand
  • 179
  • 28
  • Open Access
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
We present a novel training framework for neural sequence models, particularly for grounded dialog generation. The standard training paradigm for these models is maximum likelihood estimation (MLE),Expand
  • 83
  • 24
  • Open Access
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
  • 140
  • 20
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware ofExpand
  • 55
  • 16
  • Open Access
ParlAI: A Dialog Research Software Platform
We introduce ParlAI (pronounced "par-lay"), an open-source software platform for dialog research implemented in Python, available at this http URL. Its goal is to provide a unified framework forExpand
  • 132
  • 12
  • Open Access