From Recognition to Cognition: Visual Commonsense Reasoning

  title={From Recognition to Cognition: Visual Commonsense Reasoning},
  author={Rowan Zellers and Yonatan Bisk and Ali Farhadi and Yejin Choi},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Visual understanding goes well beyond object recognition. [] Key Method Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45…

Figures and Tables from this paper

Attention Mechanism based Cognition-level Scene Understanding

A parallel attention-based cognitive VCR network PAVCR is proposed, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference.

Multi-Level Knowledge Injecting for Visual Commonsense Reasoning

  • Zhang WenYuxin Peng
  • Computer Science
    IEEE Transactions on Circuits and Systems for Video Technology
  • 2021
This work proposes Commonsense Knowledge based Reasoning Model (CKRM), a knowledge based reasoning approach, which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer on the challenging visual commonsense reasoning dataset VCR.

VIPHY: Probing "Visible" Physical Commonsense Knowledge

This work evaluates VLMs ability to acquire “visible” physical knowledge – the information that is easily accessible from images of static scenes, particularly across the dimensions of object color, size and space, and indicates a severe gap between model and human performance across all three tasks.

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

This work introduces A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer, and demonstrates the potential of this new dataset through a detailed analysis of its contents and baseline performance mea-surements over a variety of state-of-the-art vision–language models.

Multi-Level Counterfactual Contrast for Visual Commonsense Reasoning

This paper proposes a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains and incorporates an auxiliary contrast module to directly optimize the answer prediction in VCR.

Joint Answering and Explanation for Visual Commonsense Reasoning

A plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes, which is model-agnostic and applicable to existing popular baselines and validated on the benchmark dataset.

Enforcing Reasoning in Visual Commonsense Reasoning

This paper proposes an end-to-end trainable model which considers both answers and their reasons jointly and demonstrates through experiments that the model performs competitively against current state-of-the-art.

CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense.

This work proposes a VQA benchmark, CRIC, which introduces new types of questions about Compositional Reasoning on vIsion and Commonsense, and an evaluation metric integrating the correctness of answering and commonsense grounding, and proposes an automatic algorithm to generate question samples from the scene graph associated with the images and the relevant knowledge graph.

Weakly Supervised Relative Spatial Reasoning for Visual Question Answering

Two objectives as proxies for 3D spatial reasoning (SR) – object centroid estimation, and relative position estimation are designed, and V&L is trained with weak supervision from off-the-shelf depth estimators, which leads to considerable improvements in accuracy for the "GQA" visual question answering challenge as well as improvements in relative spatial reasoning.

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

This work introduces a differentiable first-order logic formalism for VQA that explicitly decouples question answering from visual perception and proposes a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.



Inferring the Why in Images

The results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image, and recently developed natural language models to mine knowledge stored in massive amounts of text.

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

Revisiting Visual Question Answering Baselines

The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.

Learning to Act Properly: Predicting and Explaining Affordances from Images

This work proposes a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object, and collects a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

This work introduces a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed, and poses training as an adversarial game between this model and this question- only adversary -- discouraging the V QA model from capturing language bias in its question encoding.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions

This paper proposes a hybrid model which integrates a physics engine into a question answering architecture in order to anticipate future scene states resulting from object-object interactions caused by an action.