MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering

  title={MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering},
  author={Tejas Gokhale and Pratyay Banerjee and Chitta Baral and Yezhou Yang},
While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this… 

Introspective Distillation for Robust Question Answering

This paper presents a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA by introspecting whether a training sample fits in the factual ID world or the counterfactual OOD one.

A Comprehensive Survey on Visual Question Answering Debias

  • Zaiwei Lu
  • Computer Science
    2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA)
  • 2022
This work summarizes the existing methods into the following three categories: 1) Data augmentation 2) Weaken language information 3) Enhance image information, aiming to get a higher accuracy and making VQA system more robust.

Generative Bias for Visual Question Answering

A generative method to train the bias model directly from the target model, called GenB, which employs a generative network to learn the bias through a combination of the adversarial objective and knowledge distillation.

Appendix for “Introspective Distillation for Robust Question Answering” A Causal QA Model

contains ∼ 4.4M answers in the training set, and ∼ 41K images, ∼ 214K questions, and ∼ 2.1M answers in the validation set. VQA-CP contains ∼ 121K images, ∼ 438K questions, and ∼ 4.4M answers in the

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

This paper forms OOD generalization in VQA as a compositional generalization problem and proposes a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly, to alleviate the unstable training issue in graphGenerative modeling.

A Picture May Be Worth a Hundred Words for Visual Question Answering

This paper proposes to take description-question pairs as input and fed them into a language-only Transformer model, simplifying the process and the computational cost, and experiments with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias.

Check It Again:Progressive Visual Question Answering via Visual Entailment

A select-and-rerank (SAR) progressive framework based on Visual Entailment is proposed, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55% improvement.

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

This work proposes Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models, and enables universal performance lift for pre- trained models over diverse tasks designed to evaluate broad aspects of robustness.

Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA

A new dataset is proposed that considers varying types of shortcuts by con-structing different distribution shifts in multiple OOD test sets and systematically study the varying shortcuts, which may promote the exploration of shortcut learning in VQA.

Rethinking Data Augmentation for Robust Visual Question Answering

A model-agnostic DA strategy that can be seamlessly incorporated into any VQA architecture, and a knowledge distillation based answer assignment to generate pseudo answers for all composed image-question pairs, which are robust to both in-domain and out-of-distribution settings.



Cycle-Consistency for Robust Visual Question Answering

A model-agnostic framework is proposed that trains a model to not only answer a question, but also generate a question conditioned on the answer, such that the answer predicted for the generated question is the same as the ground truth answer to the original question.

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.

Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases

This paper trains a naive model that makes predictions exclusively based on dataset biases, and a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize.

Counterfactual Samples Synthesizing for Robust Visual Question Answering

A model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme that significantly improves both visual-explainable and question-sensitive abilities of VQA models and, in return, the performance of these models is further boosted.

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

This work introduces a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed, and poses training as an adversarial game between this model and this question- only adversary -- discouraging the V QA model from capturing language bias in its question encoding.

RUBi: Reducing Unimodal Biases in Visual Question Answering

RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.

Learning Answer Embeddings for Visual Question Answering

The proposed probabilistic model for visual question answering takes the semantic relationships among answers into consideration, instead of viewing them as independent ordinal numbers, and performs well not only on in-domain learning but also on transfer learning.

Self-Critical Reasoning for Robust Visual Question Answering

This work introduces a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates.

VQA-LOL: Visual Question Answering under the Lens of Logic

This paper proposes a model which uses question-attention and logic-att attention to understand logical connectives in the question, and a novel Frechet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation.

Unshuffling Data for Improved Generalization

This work describes a training procedure to capture the patterns that are stable across environments while discarding spurious ones, and demonstrates multiple use cases with the task of visual question answering, which is notorious for dataset biases.