• Corpus ID: 208614779

CLOSURE: Assessing Systematic Generalization of CLEVR Models

  title={CLOSURE: Assessing Systematic Generalization of CLEVR Models},
  author={Dzmitry Bahdanau and Harm de Vries and Timothy J. O'Donnell and Shikhar Murty and Philippe Beaudoin and Yoshua Bengio and Aaron C. Courville},
The CLEVR dataset of natural-looking questions about 3D-rendered scenes has recently received much attention from the research community. A number of models have been proposed for this task, many of which achieved very high accuracies of around 97-99%. In this work, we study how systematic the generalization of such models is, that is to which extent they are capable of handling novel combinations of known linguistic constructs. To this end, we test models' understanding of referring… 

Figures from this paper

Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

This model first creates a multimodal graph, processes it with a graph neural network to induce a factor correspondence matrix, and then outputs a symbolic program to predict answers to questions.

Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering

This work proposes a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser, and shows that this inductive bias towards tree structures dramatically improves systematic generalization to out-of- distribution examples.

Structurally Diverse Sampling Reduces Spurious Correlations in Semantic Parsing Datasets

This work proposes a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs that leads to better generalization and uses information theory to show that reduction in spurious correlations between substructures may be one reason why diverse training sets improve generalization.

A causal view of compositional zero-shot recognition

A causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data is presented, and improvements compared to strong baselines are shown.

Unobserved Local Structures Make Compositional Generalization Hard

This work investigates the factors that make generalization to certain test instances challenging and proposes a criterion for the difficulty of an example: a test instance is hard if it contains a local structure that was not observed at training time.

But Should VQA expect Them To ?

The GQAOOD benchmark is proposed, which is a benchmark designed to overcome concerns over accuracy over both rare and frequent question-answer pairs, and it is argued that the former is better suited to the evaluation of reasoning abilities.

A Benchmark for Systematic Generalization in Grounded Language Understanding

A new benchmark, gSCAN, is introduced for evaluating compositional generalization in models of situated language understanding, taking inspiration from standard models of meaning composition in formal linguistics and defining a language grounded in the states of a grid world.

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding

It is argued that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization and proposed a variational cross-graph reasoning framework that explicitly decomposes video and language into hierarchical semantic graphs, respectively, and learns semantic correspondence between the two graphs.

Improving Compositional Generalization in Semantic Parsing

This work analyzes a wide variety of models and proposes multiple extensions to the attention module of the semantic parser, aiming to improve compositional generalization in semantic parsing, as output programs are constructed from sub-components.

ReaSCAN: Compositional Reasoning in Language Grounding

This work proposes ReaSCAN, a benchmark dataset that builds off gSCAN but requires compositional language interpretation and reasoning about entities and relations, and assesses two models on Rea SCAN: a multi-modal baseline and a state-of-the-art graph convolutional neural model.



Systematic Generalization: What Is Required and Can It Be Learned?

The findings show that the generalization of modular models is much more systematic and that it is highly sensitive to the module layout, i.e. to how exactly the modules are connected, whereas systematic generalization in language understanding may require explicit regularizers or priors.

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

A novel method to systematically construct compositional generalization benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets is introduced, and it is demonstrated how this method can be used to create new compositionality benchmarks on top of the existing SCAN dataset.

Analyzing the Behavior of Visual Question Answering Models

Today's VQA models are "myopic" (tend to fail on sufficiently novel instances), often "jump to conclusions" (converge on a predicted answer after 'listening' to just half the question), and are "stubborn" (do not change their answers across images).

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

This paper proposes a new setting for Visual Question Answering where the test question-answer pairs are compositionally novel compared to training question- answer pairs, and presents a new compositional split of the VQA v1.0 dataset, which it is called Compositional VZA (C-VQA).

Learning Visual Reasoning Without Strong Priors

This work shows that a general-purpose, Conditional Batch Normalization approach achieves state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4% error rate, and probes the model to shed light on how it reasons, showing it has learned a question-dependent, multi-step process.

Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks

Systematic compositionality is the ability to recombine meaningful units with regular and predictable outcomes, and it’s seen as key to the human capacity for generalization in language. Recent work

Learning to Reason: End-to-End Module Networks for Visual Question Answering

End-to-End Module Networks are proposed, which learn to reason by directly predicting instance-specific network layouts without the aid of a parser, and achieve an error reduction of nearly 50% relative to state-of-theart attentional approaches.

Neural Module Networks

A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).

Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks

This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.

Compositional Attention Networks for Machine Reasoning

The MAC network is presented, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning that is computationally-efficient and data-efficient, in particular requiring 5x less data than existing models to achieve strong results.