Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
- Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra
- Computer ScienceIEEE International Conference on Computer Vision
- 7 October 2016
This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image.
Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization
- Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra
- Computer Science
- 2016
It is shown that Guided Grad-CAM helps untrained users successfully discern a "stronger" deep network from a "weaker" one even when both networks make identical predictions, and also exposes the somewhat surprising insight that common CNN + LSTM models can be good at localizing discriminative input image regions despite not being trained on grounded image-text pairs.
Embodied Question Answering
- Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra
- Computer ScienceIEEE/CVF Conference on Computer Vision and…
- 30 November 2017
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question.
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
- Abhishek Das, Satwik Kottur, J. Moura, Stefan Lee, Dhruv Batra
- Computer ScienceIEEE International Conference on Computer Vision
- 20 March 2017
This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision.
TarMAC: Targeted Multi-Agent Communication
- Abhishek Das, Théophile Gervet, Joelle Pineau
- Computer ScienceInternational Conference on Machine Learning
- 27 September 2018
This work proposes a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and whom to address them to while performing cooperative tasks in partially-observable environments, and augment this with a multi-round communication approach.
Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?
- Abhishek Das, Harsh Agrawal, C. L. Zitnick, Devi Parikh, Dhruv Batra
- Computer ScienceConference on Empirical Methods in Natural…
- 11 June 2016
The VQA-HAT (Human ATtention) dataset is introduced and attention maps generated by state-of-the-art V QA models are evaluated against human attention both qualitatively and quantitatively.
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
- Vishvak S. Murahari, Dhruv Batra, Devi Parikh, Abhishek Das
- Computer ScienceEuropean Conference on Computer Vision
- 5 December 2019
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics.
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
- Chiori Hori, Huda AlAmri, Devi Parikh
- Computer ScienceIEEE International Conference on Acoustics…
- 21 June 2018
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.
Neural Modular Control for Embodied Question Answering
- Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra
- Computer ScienceConference on Robot Learning
- 23 October 2018
This work uses imitation learning to warm-start policies at each level of the hierarchy, dramatically increasing sample efficiency, followed by reinforcement learning, for learning policies for navigation over long planning horizons from language input.
Embodied Question Answering in Photorealistic Environments With Point Cloud Perception
- Erik Wijmans, Samyak Datta, Dhruv Batra
- Computer ScienceComputer Vision and Pattern Recognition
- 6 April 2019
It is found that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.
...
...