From Recognition to Cognition: Visual Commonsense Reasoning

@article{Zellers2019FromRT,
  title={From Recognition to Cognition: Visual Commonsense Reasoning},
  author={Rowan Zellers and Yonatan Bisk and Ali Farhadi and Yejin Choi},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019},
  pages={6713-6724}
}
Visual understanding goes well beyond object recognition. [...] Key Method Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45…Expand
Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
  • Zhang Wen, Yuxin Peng
  • Computer Science
  • IEEE Transactions on Circuits and Systems for Video Technology
  • 2021
TLDR
This work proposes Commonsense Knowledge based Reasoning Model (CKRM), a knowledge based reasoning approach, which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer on the challenging visual commonsense reasoning dataset VCR. Expand
Enforcing Reasoning in Visual Commonsense Reasoning
TLDR
This paper proposes an end-to-end trainable model which considers both answers and their reasons jointly and demonstrates through experiments that the model performs competitively against current state-of-the-art. Expand
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering
Vision-and-language (V&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay betweenExpand
Project on Visual Commonsense Reasoning Anonymous ACL submission
  • 2019
1 Credits 2 Introduction Human understand the world by recognizing objects in the context and reasoning their relationships. Trying to mimic the ways human brain recognize objects in visual scene,Expand
Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory
  • Xuejiao Tang, Xin Huang, +4 authors Ji Zhang
  • Computer Science
  • DaWaK
  • 2021
TLDR
A dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference and provides intuitive interpretation into visual commonsense reasoning is proposed. Expand
A Simple Baseline for Visual Commonsense Reasoning
TLDR
It is shown that a much simpler model can perform better with half the number of trainable parameters, and is obtained by associating visual features with attribute information and better text to image grounding, which results in further improvements for the simpler & effective baseline, TAB-VCR. Expand
Computer vision beyond the visible : image understanding through language
TLDR
This thesis presents contributions to the problem of image-to-set prediction, understood as the task of predicting a variable-sized collection of unordered elements for an input image, and conducts a thorough analysis of current methods for multi-label image classification, which are able to solve the task in an end- to-end manner. Expand
Connective Cognition Network for Directional Visual Commonsense Reasoning
TLDR
This work proposes a connective cognition network (CCN) to dynamically reorganize the visual neuron connectivity that is contextualized by the meaning of questions and answers and proposes directional connectivity to infer answers or rationales. Expand
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
TLDR
This work incorporates commonsense knowledge into the cross-modal BERT, and proposes a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short), which outperforms other task-specific models and general task-agnostic pre-training models by a large margin. Expand
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
TLDR
A much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters, and is obtained for the new visual commonsense reasoning (VCR) task, TAB-VCR. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 98 REFERENCES
Inferring the Why in Images
TLDR
The results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image, and recently developed natural language models to mine knowledge stored in massive amounts of text. Expand
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks. Expand
Revisiting Visual Question Answering Baselines
TLDR
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed. Expand
Learning to Act Properly: Predicting and Explaining Affordances from Images
TLDR
This work proposes a model that exploits Graph Neural Networks to propagate contextual information from the scene in order to perform detailed affordance reasoning about each object, and collects a new dataset that builds upon ADE20k, referred to as ADE-Affordance, which contains annotations enabling such rich visual reasoning. Expand
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks. Expand
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs. Expand
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Expand
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
TLDR
This work introduces a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed, and poses training as an adversarial game between this model and this question- only adversary -- discouraging the V QA model from capturing language bias in its question encoding. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
TLDR
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. Expand
...
1
2
3
4
5
...