• Publications
  • Influence
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
TLDR
This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image. Expand
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
TLDR
This work combines existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and applies it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. Expand
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
TLDR
This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision. Expand
Embodied Question Answering
TLDR
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question. Expand
Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?
TLDR
The VQA-HAT (Human ATtention) dataset is introduced and attention maps generated by state-of-the-art V QA models are evaluated against human attention both qualitatively and quantitatively. Expand
Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization
TLDR
It is shown that Guided Grad-CAM helps untrained users successfully discern a "stronger" deep network from a "weaker" one even when both networks make identical predictions, and also exposes the somewhat surprising insight that common CNN + LSTM models can be good at localizing discriminative input image regions despite not being trained on grounded image-text pairs. Expand
TarMAC: Targeted Multi-Agent Communication
TLDR
This work proposes a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments, and augment this with a multi-round communication approach. Expand
Visual Dialog
TLDR
The task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content, is introduced and the first ‘visual chatbot’ is demonstrated. Expand
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
TLDR
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video. Expand
Neural Modular Control for Embodied Question Answering
TLDR
This work uses imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning, for learning policies for navigation over long planning horizons from language input. Expand
...
1
2
3
...