Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

  title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
  author={Yash Goyal and Tejas Khot and Aishwarya Agrawal and Douglas Summers-Stay and Dhruv Batra and Devi Parikh},
  journal={International Journal of Computer Vision},
  pages={398 - 414}
The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. [] Key Result This can help in building trust for machines among their users.

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

This work proposes a novel model-agnostic question encoder, Visually-Grounded Question Encoder (VGQE), for VQA that reduces the dependency of the model on the language priors, and achieves state-of-the-art results on the bias-sensitive split of the VQAv2 dataset.

RUBi: Reducing Unimodal Biases in Visual Question Answering

RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.

Visual Question Generation as Dual Task of Visual Question Answering

This paper proposes an end-to-end unified model, the Invertible Question Answering Network (iQAN), to introduce question generation as a dual task of question answering to improve the VQA performance and shows that the proposed dual training framework can consistently improve model performances of many popular V QA architectures.

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.

Overcoming language priors in VQA via adding visual module

This work proposes a method that will improve visual content further to enhance the impact of visual content on answers in VQA and proves the effectiveness of the method and further improves the accuracy of the different models.

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

This paper proposes a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer, and shows that iVZA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iV QA model needs to understand the image better to be successful.


This work proposes an end-to-end unified framework, the Invertible Question Answering Network (iQAN), to leverage the complementary relations between questions and answers in images by jointly training the model on VQA and VQG tasks.

Estimating semantic structure for the VQA answer space

This work proposes two measures of proximity between VQA classes, and proposes a corresponding loss which takes into account the estimated proximity, and shows that this approach is completely model-agnostic since it allows consistent improvements with three different V QA models.

An experimental study of the vision-bottleneck in VQA

This work proposes an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images, and study the impact of two methods to incorporate the information about objects necessary for answering a question in the reasoning module directly and earlier in the object selection stage.

VC-VQA: Visual Calibration Mechanism For Visual Question Answering

The proposed model reconstructs image features based on predicted answer with question and measures the similarity between reconstructed image feature and original image feature, which will guide the VQA model predict the final answer.