Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

  title={Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs},
  author={Ana Marasovi{\'c} and Chandra Bhagavatula and Jae Sung Park and Ronan Le Bras and Noah A. Smith and Yejin Choi},
Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is… 

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

A unified C hunk-aware A lignment and Le xical C onstraint based method, dubbed as CALeC, which significantly outperforms other competitor models on inference accuracy and quality of generated explanations.

Rationale-Inspired Natural Language Explanations with Commonsense

This paper introduces a selfrationalizing framework, called REXC, that extracts rationales as most responsible features for the predictions, expands the extractive rationales using commonsense resources, and selects the best-suited commonsense knowledge to generate NLEs and give the final prediction.

ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning

This work presents ExplaGraphs, a new generative and structured commonsense-reasoning task (and an associated dataset) of explanation graph generation for stance prediction, and proposes a multi-level evaluation framework that check for the structural and semantic correctness of the generated graphs and their degree of match with ground-truth graphs.

REX: Reasoning-aware and Grounded Explanation

  • Shi ChenQi Zhao
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
A new type of multi-modal explanations that explain the decisions by progressively traversing the reasoning process and grounding keywords in the images are defined, and a novel explanation generation method that explicitly models the pairwise correspondence between words and regions of interest is proposed.

CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

The large-scale CLEVR-X dataset is introduced that extends the CLEVR dataset with natural language explanations and a user study is conducted to confirm that the ground-truth explanations in the proposed dataset are indeed complete and relevant.

Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations

This work introduces RE X C, a self-rationalizing framework that grounds its predictions and two complementary types of explanations (NLEs and extractive rationales) in background knowledge, and improves over previous methods by reaching SOTA task performance while also providing explanations.

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

It is shown that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multi-modality self-rationalization of tasks with multimodal inputs and is observed that no model type works universally the best across tasks/datasets and data sizes.

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

NLX-GPT is introduced, a general, compact and faithful language model that can simultaneously predict an answer and explain it, and is much less × faster than the current SoA.

Few-Shot Self-Rationalization with Natural Language Prompts

This work identifies the right prompting approach by extensively exploring natural language prompts on FEB and demonstrates that making progress on few-shot self-rationalization is possible, and presents FEB—a stan-dardized collection of four existing English-language datasets and associated metrics.

Commonsense Knowledge Reasoning and Generation with Pre-trained Language Models: A Survey

A survey of commonsense knowledge acquisition and reasoning tasks, the strengths and weaknesses of state-of-the-art pre-trained models for commonsense reasoning and generation as revealed by these tasks, and reflects on future research directions are presented.



Visual Entailment: A Novel Task for Fine-Grained Image Understanding

A new inference task, Visual Entailed (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks is introduced.

Grounded Textual Entailment

This paper argues for a visually-grounded version of the Textual Entailment task, and asks whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant “world” or “situation”).

Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image

This work proposes VisualComet, the novel framework of visual commonsense reasoning tasks to predict events thatmight have happened before, events that might happen next, and the intents of the people at present, and establishes strong baseline performances on this task and demonstrates that integration between visual and textual Commonsense reasoning is the key and wins over non-integrative alternatives.

Explain Yourself! Leveraging Language Models for Commonsense Reasoning

This work collects human explanations for commonsense reasoning in the form of natural language sequences and highlighted annotations in a new dataset called Common Sense Explanations to train language models to automatically generate explanations that can be used during training and inference in a novel Commonsense Auto-Generated Explanation framework.

From Recognition to Cognition: Visual Commonsense Reasoning

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

Grounding Visual Explanations

A phrase-critic model to refine generated candidate explanations augmented with flipped phrases to improve the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.

e-SNLI: Natural Language Inference with Natural Language Explanations

The Stanford Natural Language Inference dataset is extended with an additional layer of human-annotated natural language explanations of the entailment relations, which can be used for various goals, such as obtaining full sentence justifications of a model’s decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets.

Unsupervised Commonsense Question Answering with Self-Talk

An unsupervised framework based on self-talk as a novel alternative to multiple-choice commonsense tasks, inspired by inquiry-based discovery learning, which improves performance on several benchmarks and competes with models that obtain knowledge from external KBs.

NILE : Natural Language Inference with Faithful Natural Language Explanations

This work proposes Natural-language Inference over Label-specific Explanations (NILE), a novel NLI method which utilizes auto-generated label-specific NL explanations to produce labels along with its faithful explanation and demonstrates NILE’s effectiveness over previously reported methods through automated and human evaluation of the produced labels and explanations.

Faithful Multimodal Explanation for Visual Question Answering

This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations.