CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

@article{Johnson2017CLEVRAD,
  title={CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning},
  author={Justin Johnson and Bharath Hariharan and Laurens van der Maaten and Li Fei-Fei and C. Lawrence Zitnick and Ross B. Girshick},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={1988-1997}
}
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover short-comings. [...] Key Method It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.Expand
Evaluating Visual Reasoning through Grounded Language Understanding
TLDR
The Cornell Natural Language Visual Reasoning (NLVR) corpus, which targets reasoning skills like counting, comparisons, and set theory, is introduced, which confirms that NLVR presents diversity and complexity beyond what is provided by contemporary benchmarks.
Analyzing Compositionality in Visual Question Answering
TLDR
This paper analyzes the performance of one of transformer models pretrained on large amounts of images and associated text, LXMERT, and shows that despite the model’s strong quantitative results, it may not be performing compositional reasoning because it does not need many relational cues to achieve this performance and more generally uses relatively little linguistic information.
A Bayesian approach to Visual Question Answering
Visual question answering (VQA) is a complex task involving perception and reasoning. Typical approaches which use black-box neural networks, do well on perception but fail to generalize due to lack
Object-Based Reasoning in VQA
TLDR
A solution combining state-of-the-art object detection and reasoning modules for VQA, achieved on the well-balanced CLEVR dataset, and shows significant, few percent improvements of accuracy on the complex "counting" task.
Evaluation of Multiple Approaches for Visual Question Reasoning
TLDR
This work investigates two end-to-end architectures augmented with a relational neural module on a challenging Cornell NLVR visual question answering task and achieves state-of-the-art performance outperforming the results reported on the same benchmark dataset.
SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning
TLDR
Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs’ capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ.
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and
Interpretable Visual Question Answering by Reasoning on Dependency Trees
TLDR
A novel neural network model that performs global reasoning on a dependency tree parsed from the question and is capable of building an interpretable visual question answering (VQA) system that gradually derives image cues following question-driven parse-tree reasoning.
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
TLDR
A much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters, and is obtained for the new visual commonsense reasoning (VCR) task, TAB-VCR.
CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions
Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
Dynamic Memory Networks for Visual and Textual Question Answering
TLDR
The new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.
A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
TLDR
This work proposes a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision by a multi-world approach that represents uncertainty about the perceived world in a bayesian framework.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
TLDR
This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.
Yin and Yang: Balancing and Answering Binary Visual Questions
TLDR
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
Neural Module Networks
TLDR
A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
Exploring Models and Data for Image Question Answering
TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Towards a Visual Turing Challenge
TLDR
This paper discusses and exemplifies some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge, and argues despite the success of unique ground-truth annotation, the authors likely have to step away from carefully curated dataset and rather rely on 'social consensus' as the main driving force to create suitable benchmarks.
Revisiting Visual Question Answering Baselines
TLDR
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
Learning to Compose Neural Networks for Question Answering
TLDR
A question answering model that applies to both images and structured knowledge bases that uses natural language strings to automatically assemble neural networks from a collection of composable modules that achieves state-of-the-art results on benchmark datasets.
...
1
2
3
4
5
...