• Publications
  • Influence
VQA: Visual Question Answering
TLDR
We propose the task of free-form and open-ended Visual Question Answering (VQA). Expand
  • 2,198
  • 519
  • PDF
CIDEr: Consensus-based image description evaluation
TLDR
Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Expand
  • 1,583
  • 480
  • PDF
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
TLDR
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Expand
  • 2,679
  • 456
  • PDF
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
We balance the popular VQA dataset (Antol et al., ICCV 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image. Expand
  • 783
  • 177
  • PDF
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
TLDR
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Expand
  • 1,451
  • 145
  • PDF
Hierarchical Question-Image Co-Attention for Visual Question Answering
TLDR
We present a novel co-attention model for VQA that jointly reasons about image and question attention, and improve the state-of-the-art on the dataset. Expand
  • 894
  • 120
  • PDF
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
TLDR
We introduce an adaptive attention encoder-decoder framework which can automatically decide when to rely on visual signals and when to just rely on the language model. Expand
  • 755
  • 105
  • PDF
Relative attributes
Human-nameable visual “attributes” can benefit various recognition tasks. However, existing techniques restrict these properties to categorical labels (for example, a person is ‘smiling’ or not, aExpand
  • 837
  • 104
  • PDF
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
TLDR
We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. Expand
  • 464
  • 104
  • PDF
Visual Dialog
TLDR
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Expand
  • 366
  • 63
  • PDF
...
1
2
3
4
5
...