• Corpus ID: 208614779

CLOSURE: Assessing Systematic Generalization of CLEVR Models

  title={CLOSURE: Assessing Systematic Generalization of CLEVR Models},
  author={Dzmitry Bahdanau and Harm de Vries and Timothy J. O'Donnell and Shikhar Murty and Philippe Beaudoin and Yoshua Bengio and Aaron C. Courville},
The CLEVR dataset of natural-looking questions about 3D-rendered scenes has recently received much attention from the research community. A number of models have been proposed for this task, many of which achieved very high accuracies of around 97-99%. In this work, we study how systematic the generalization of such models is, that is to which extent they are capable of handling novel combinations of known linguistic constructs. To this end, we test models' understanding of referring… 

Figures from this paper

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

This paper proposes a visual capsule module with a query-based selection mechanism of capsule features, that allows the model to focus on relevant regions based on the textual cues about visual information in the question and shows that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.

CURI: A Benchmark for Productive Concept Learning Under Uncertainty

A new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI), which defines a model-independent "compositionality gap" to evaluate the difficulty of generalizing out-of-distribution along each of these axes.

Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning

A virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently, and suggests that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong V QA model that is more robust to domain shifts.

Multimodal Graph Networks for Compositional Generalization in Visual Question Answering

This model first creates a multimodal graph, processes it with a graph neural network to induce a factor correspondence matrix, and then outputs a symbolic program to predict answers to questions.

Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering

This work proposes a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser, and shows that this inductive bias towards tree structures dramatically improves systematic generalization to out-of- distribution examples.

Structurally Diverse Sampling Reduces Spurious Correlations in Semantic Parsing Datasets

This work proposes a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs that leads to better generalization and uses information theory to show that reduction in spurious correlations between substructures may be one reason why diverse training sets improve generalization.

A causal view of compositional zero-shot recognition

A causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data is presented, and improvements compared to strong baselines are shown.

But Should VQA expect Them To ?

The GQAOOD benchmark is proposed, which is a benchmark designed to overcome concerns over accuracy over both rare and frequent question-answer pairs, and it is argued that the former is better suited to the evaluation of reasoning abilities.

A Benchmark for Systematic Generalization in Grounded Language Understanding

A new benchmark, gSCAN, is introduced for evaluating compositional generalization in models of situated language understanding, taking inspiration from standard models of meaning composition in formal linguistics and defining a language grounded in the states of a grid world.

Improving Compositional Generalization in Semantic Parsing

This work analyzes a wide variety of models and proposes multiple extensions to the attention module of the semantic parser, aiming to improve compositional generalization in semantic parsing, as output programs are constructed from sub-components.



Systematic Generalization: What Is Required and Can It Be Learned?

The findings show that the generalization of modular models is much more systematic and that it is highly sensitive to the module layout, i.e. to how exactly the modules are connected, whereas systematic generalization in language understanding may require explicit regularizers or priors.

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

A novel method to systematically construct compositional generalization benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets is introduced, and it is demonstrated how this method can be used to create new compositionality benchmarks on top of the existing SCAN dataset.

Analyzing the Behavior of Visual Question Answering Models

Today's VQA models are "myopic" (tend to fail on sufficiently novel instances), often "jump to conclusions" (converge on a predicted answer after 'listening' to just half the question), and are "stubborn" (do not change their answers across images).

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

This paper proposes a new setting for Visual Question Answering where the test question-answer pairs are compositionally novel compared to training question- answer pairs, and presents a new compositional split of the VQA v1.0 dataset, which it is called Compositional VZA (C-VQA).

Learning Visual Reasoning Without Strong Priors

This work shows that a general-purpose, Conditional Batch Normalization approach achieves state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4% error rate, and probes the model to shed light on how it reasons, showing it has learned a question-dependent, multi-step process.

Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks

Systematic compositionality is the ability to recombine meaningful units with regular and predictable outcomes, and it’s seen as key to the human capacity for generalization in language. Recent work

Learning to Reason: End-to-End Module Networks for Visual Question Answering

End-to-End Module Networks are proposed, which learn to reason by directly predicting instance-specific network layouts without the aid of a parser, and achieve an error reduction of nearly 50% relative to state-of-theart attentional approaches.

ShapeWorld - A new test methodology for multimodal language understanding

We introduce a novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities. In this approach, artificial data is

Neural Module Networks

A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).

Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks

This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.