Generating Visual Explanations

  title={Generating Visual Explanations},
  author={Lisa Anne Hendricks and Zeynep Akata and Marcus Rohrbach and Jeff Donahue and Bernt Schiele and Trevor Darrell},
Clearly explaining a rationale for a classification decision to an end user can be as important as the decision itself. [] Key Method Through a novel loss function based on sampling and reinforcement learning, our model learns to generate sentences that realize a global sentence property, such as class specificity. Our results on the CUB dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing…

Generating Post-Hoc Rationales of Deep Visual Classification Decisions

This work emphasizes the importance of producing an explanation for an observed action, which could be applied to a black-box decision agent, akin to what one human produces when asked to explain the actions of a second human.

Grounding Visual Explanations

A phrase-critic model to refine generated candidate explanations augmented with flipped phrases to improve the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.

UvA-DARE ( Digital Academic Repository ) Grounding Visual Explanations

A phrase-critic model to refine generated candidate explanations augmented with flipped phrases to improve the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.

Why do These Match? Explaining the Behavior of Image Similarity Models

Salient Attributes for Network Explanation is introduced to explain image similarity models, where a model's output is a score measuring the similarity of two inputs rather than a classification score, and can also improve performance on the classic task of attribute recognition.

Grounding Visual Explanations (Extended Abstract)

A phrase-critic model is introduced to refine (re-score/re-rank) generated candidate explanations and employ a relative-attribute inspired ranking loss using "flipped" phrases as negative examples for training.

Interpretable Basis Decomposition for Visual Explanation

A new framework called Interpretable Basis Decomposition for providing visual explanations for classification networks is proposed, decomposing the neural activations of the input image into semantically interpretable components pre-trained from a large concept corpus.

Attentive Explanations: Justifying Decisions and Pointing to the Evidence

This work proposes a PJ-X model which can justify its decision with a sentence and point to the evidence by introspecting its decision and explanation process using an attention mechanism and focuses on explaining human activities which is traditionally more challenging than object classification.

Generating Coherent and Informative Descriptions for Groups of Visual Objects and Categories: A Simple Decoding Approach

This work proposes an inference mechanism that extends an instance-level captioning model to generate coherent and informative descriptions for groups of visual objects from the same or different categories, and test the model in the domain of bird descriptions.

Classifier Labels as Language Grounding for Explanations

A novel approach to generating explanations is presented, which first finds the important features that most affect the classification prediction and then utilizes a secondary detector which can identify and label multiple parts of the features, to label only those important features.



Learning Deep Representations of Fine-Grained Visual Descriptions

This model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero- shot classification on the Caltech-UCSD Birds 200-2011 dataset.

Generation and Comprehension of Unambiguous Object Descriptions

This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.

From captions to visual concepts and back

This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.

Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

Using linguistic context and visual features, the method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

Deep Visual-Semantic Alignments for Generating Image Descriptions

  • A. KarpathyLi Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.

Baby talk: Understanding and generating simple image descriptions

A system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision that is very effective at producing relevant sentences for images.

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

The Deep Compositional Captioner (DCC) is proposed to address the task of generating descriptions of novel objects which are not present in paired imagesentence datasets by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts.

Justification Narratives for Individual Classifications

This paper introduces the idea of a justification narrative: a simple model-agnostic mapping of the essential values underlying a classification to a semantic space and presents a package that automatically produces these narratives and realizes them visually or textually.