• Publications
  • Influence
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
TLDR
This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image. Expand
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
TLDR
This work combines existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and applies it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. Expand
Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization
TLDR
It is shown that Guided Grad-CAM helps untrained users successfully discern a "stronger" deep network from a "weaker" one even when both networks make identical predictions, and also exposes the somewhat surprising insight that common CNN + LSTM models can be good at localizing discriminative input image regions despite not being trained on grounded image-text pairs. Expand
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
TLDR
Diverse Beam Search is proposed, an alternative to BS that decodes a list of diverse outputs by optimizing for a diversity-augmented objective and consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models. Expand
Grad-CAM: Why did you say that?
We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. OurExpand
Diverse Beam Search for Improved Description of Complex Scenes
TLDR
Diverse Beam Search is proposed, a diversity promoting alternative to BS for approximate inference that produces sequences that are significantly different from each other by incorporating diversity constraints within groups of candidate sequences during decoding; moreover, it achieves this with minimal computational or memory overhead. Expand
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
TLDR
This work proposes a generic approach called Human Importance-aware Network Tuning (HINT), which effectively leverages human demonstrations to improve visual grounding and encourages deep networks to be sensitive to the same input regions as humans. Expand
Counting Everyday Objects in Everyday Scenes
TLDR
This work builds dedicated models for counting designed to tackle the large variance in counts, appearances, and scales of objects found in natural scenes, inspired by the phenomenon of subitizing. Expand
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
TLDR
A contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning, and provides a theoretical analysis of ALBEF from a mutual information maximization perspective. Expand
SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions
TLDR
The extent to which consistency issues occur in VQA is quantified and an approach called Sub-Question Importance-aware Network Tuning (SQuINT) is proposed, which encourages the model to attend do the same parts of the image when answering the reasoning question and the perception sub questions. Expand
...
1
2
...