• Corpus ID: 195584122

RUBi: Reducing Unimodal Biases in Visual Question Answering

@inproceedings{Cadne2019RUBiRU,
  title={RUBi: Reducing Unimodal Biases in Visual Question Answering},
  author={R{\'e}mi Cad{\`e}ne and Corentin Dancette and Hedi Ben-younes and Matthieu Cord and Devi Parikh},
  booktitle={NeurIPS},
  year={2019}
}
Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most… 

Figures and Tables from this paper

Debiased Visual Question Answering from Feature and Sample Perspectives
TLDR
A method named D-VQA is proposed to alleviate the above challenges from the feature and sample perspectives, which applies two unimodal bias detection modules to explicitly recognise and remove the negative biases in language and vision modalities.
Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder
TLDR
This work proposes a novel model-agnostic question encoder, Visually-Grounded Question Encoder (VGQE), for VQA that reduces the dependency of the model on the language priors, and achieves state-of-the-art results on the bias-sensitive split of the VQAv2 dataset.
Greedy Gradient Ensemble for Robust Visual Question Answering
TLDR
A new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning and forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models.
Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
TLDR
This paper first automatically generate labeled data to balance the biased data, and proposes a self-supervised auxiliary task to utilize the balanced data to assist the base VQA model to overcome language priors.
Estimating semantic structure for the VQA answer space
TLDR
This work proposes two measures of proximity between VQA classes, and proposes a corresponding loss which takes into account the estimated proximity, and shows that this approach is completely model-agnostic since it allows consistent improvements with three different V QA models.
But Should VQA expect Them To ?
TLDR
The GQAOOD benchmark is proposed, which is a benchmark designed to overcome concerns over accuracy over both rare and frequent question-answer pairs, and it is argued that the former is better suited to the evaluation of reasoning abilities.
WeaQA: Weak Supervision via Captions for Visual Question Answering
TLDR
This work presents a method to train models with synthetic Q-A pairs generated procedurally from captions, and demonstrates the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models.
Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering
TLDR
It is demonstrated that even state-of-the-art models perform poorly and that existing techniques to reduce biases are largely ineffective in this context.
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering
TLDR
It is found that many of the “unknowns” to the learned VQA model are indeed “known” in the dataset implicitly, and a simple data augmentation pipeline SimpleAug is presented to turn this “ known” knowledge into training examples for V QA.
Introspective Distillation for Robust Question Answering
TLDR
This paper presents a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA by introspecting whether a training sample fits in the factual ID world or the counterfactual OOD one.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Revisiting Visual Question Answering Baselines
TLDR
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
TLDR
This work introduces a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed, and poses training as an adversarial game between this model and this question- only adversary -- discouraging the V QA model from capturing language bias in its question encoding.
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
TLDR
GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.
Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets
TLDR
This paper focuses on the design of multiple-choice based datasets where the learner has to select the right answer from a set of candidate ones including the target and the decoys, and proposes automatic procedures to remedy such design deficiencies.
An Analysis of Visual Question Answering Algorithms
TLDR
This paper analyzes existing VQA algorithms using a new dataset called the Task Driven Image Understanding Challenge (TDIUC), which has over 1.6 million questions organized into 12 different categories, and proposes new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms.
Answer Them All! Toward Universal Visual Question Answering Models
TLDR
A new VQA algorithm is proposed that rivals or exceeds the state-of-the-art for both domains and uses the same visual features, answer vocabularies, etc.
Explicit Bias Discovery in Visual Question Answering Models
TLDR
This work stores the words of the question, answer and visual words corresponding to regions of interest in attention maps in a database, and runs simple rule mining algorithms on this database to discover human-interpretable rules which give unique insight into the behavior of VQA models.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
VizWiz Grand Challenge: Answering Visual Questions from Blind People
TLDR
Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset, which is introduced to encourage a larger community to develop more generalized algorithms that can assist blind people.
...
...