• Corpus ID: 229340515

Object-Centric Diagnosis of Visual Reasoning

  title={Object-Centric Diagnosis of Visual Reasoning},
  author={Jianwei Yang and Jiayuan Mao and Jiajun Wu and Devi Parikh and David Cox and Joshua B. Tenenbaum and Chuang Gan},
When answering questions about an image, it not only needs knowing what – understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why – reasoning over grounding visual cues to derive the answer for a question. Over the last few years, we have seen significant progresses on visual question answering. Though impressive as the accuracy grows, it still lags behind to get knowing whether these models are undertaking grounding visual reasoning or just… 

Figures and Tables from this paper

From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features and extensive experimental results show that the proposed RSVQA framework can achieve promising performance.
Object-Centric Representation Learning with Generative Spatial-Temporal Factorization
DyMON learns—without supervision—to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations, and constructs scene object spatial representations suitable for rendering at arbitrary times and from arbitrary viewpoints.
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
This paper proposes MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question, and shows that the pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances.
ProTo: Program-Guided Transformer for Program-Guided Tasks
It is demonstrated that ProTo outperforms the previous state-of-the-art methods on GQA visual reasoning and 2D Minecraft policy learning datasets and demonstrates better generalization to unseen, complex, and human-written programs.
KANDINSKYPatterns - An experimental exploration environment for Pattern Analysis and Machine Intelligence
This paper discusses existing diagnostic tests and test datasets such as CLEVR, CLEVERER, CLOSURE, CURI, Bongard-LOGO, V-PROM, and presents the KANDINSKYPatterns, named after the Russian artist Wassily Kandinksy, which have computationally controllable properties and are easily distinguishable by human observers.
Data Efficient Masked Language Modeling for Vision and Language
This paper investigates a range of alternative masking strategies specific to the cross-modal setting that address shortcomings of MLM, aiming for better fusion of text and image in the learned representation.


Visual7W: Grounded Question Answering in Images
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions
Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current
Yin and Yang: Balancing and Answering Binary Visual Questions
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
Revisiting Visual Question Answering Baselines
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded
This work proposes a generic approach called Human Importance-aware Network Tuning (HINT), which effectively leverages human demonstrations to improve visual grounding and encourages deep networks to be sensitive to the same input regions as humans.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
A negative case analysis of visual grounding methods for VQA
It is found that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements, and a simpler regularization scheme is proposed that achieves near state-of-the-art performance on VQA-CPv2.
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering
This work presents MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input, to improve OOD generalization, such as the VQA-CP challenge.
Interpretable Counting for Visual Question Answering
The model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections and outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.