Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

  title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
  author={Yash Goyal and Tejas Khot and Douglas Summers-Stay and Dhruv Batra and Devi Parikh},
  journal={International Journal of Computer Vision},
The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. [] Key Result This can help in building trust for machines among their users.

RUBi: Reducing Unimodal Biases in Visual Question Answering

RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.

Visual Question Generation as Dual Task of Visual Question Answering

This paper proposes an end-to-end unified model, the Invertible Question Answering Network (iQAN), to introduce question generation as a dual task of question answering to improve the VQA performance and shows that the proposed dual training framework can consistently improve model performances of many popular V QA architectures.

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

This paper proposes a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer, and shows that iVZA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iV QA model needs to understand the image better to be successful.


This work proposes an end-to-end unified framework, the Invertible Question Answering Network (iQAN), to leverage the complementary relations between questions and answers in images by jointly training the model on VQA and VQG tasks.

Estimating semantic structure for the VQA answer space

This work proposes two measures of proximity between VQA classes, and proposes a corresponding loss which takes into account the estimated proximity, and shows that this approach is completely model-agnostic since it allows consistent improvements with three different V QA models.

VC-VQA: Visual Calibration Mechanism For Visual Question Answering

The proposed model reconstructs image features based on predicted answer with question and measures the similarity between reconstructed image feature and original image feature, which will guide the VQA model predict the final answer.

Answer Them All! Toward Universal Visual Question Answering Models

A new VQA algorithm is proposed that rivals or exceeds the state-of-the-art for both domains and uses the same visual features, answer vocabularies, etc.

SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering

This work studies the robustness of VQA models from a novel perspective: visual context and proposes a simple yet effective perturbation technique, SwapMix, which can be applied as a data augmentation strategy during training in order to regularize the context over-reliance.

Debiased Visual Question Answering from Feature and Sample Perspectives

A method named D-VQA is proposed to alleviate the above challenges from the feature and sample perspectives, which applies two unimodal bias detection modules to explicitly recognise and remove the negative biases in language and vision modalities.



Hierarchical Question-Image Co-Attention for Visual Question Answering

This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset

This paper proposes a new setting for Visual Question Answering where the test question-answer pairs are compositionally novel compared to training question- answer pairs, and presents a new compositional split of the VQA v1.0 dataset, which it is called Compositional VZA (C-VQA).

An Analysis of Visual Question Answering Algorithms

This paper analyzes existing VQA algorithms using a new dataset called the Task Driven Image Understanding Challenge (TDIUC), which has over 1.6 million questions organized into 12 different categories, and proposes new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms.

Yin and Yang: Balancing and Answering Binary Visual Questions

This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.

Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

These approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks and are shown to be more intelligent, reasonable, and human-like than previous approaches.

Visual question answering: Datasets, algorithms, and future challenges

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

DualNet: Domain-invariant network for visual question answering

A method is proposed called DualNet that demonstrates performance that is invariant to the differences in real and abstract scene domains, and Experimental results show that DualNet outperforms state-of-the-art methods, especially for the abstract images category.