Answer Them All! Toward Universal Visual Question Answering Models

@article{Shrestha2019AnswerTA,
  title={Answer Them All! Toward Universal Visual Question Answering Models},
  author={Robik Shrestha and Kushal Kafle and Christopher Kanan},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019},
  pages={10464-10473}
}
Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g… 

Figures and Tables from this paper

Comparative Study of Visual Question Answering Algorithms
TLDR
This study compares the performance of state of the art V QA algorithms on different VQA benchmarks and introduces external-knowledge based algorithms which need external sources to be able to retrieve facts necessary to answer a question when those facts may not be present in the scene nor in the whole training data set.
Visual question answering: a state-of-the-art review
TLDR
This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics and discusses future research directions for all the above-mentioned aspects of V QA separately.
RUBi: Reducing Unimodal Biases in Visual Question Answering
TLDR
RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
Answering Questions about Data Visualizations using Efficient Bimodal Fusion
TLDR
This work proposes a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL), which first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learnedembeddings to answer the given question.
A negative case analysis of visual grounding methods for VQA
TLDR
It is found that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements, and a simpler regularization scheme is proposed that achieves near state-of-the-art performance on VQA-CPv2.
CQ-VQA: Visual Question Answering on Categorized Questions
TLDR
A novel two-level hierarchical but end-to-end model to solve the task of visual question answering (VQA) is proposed and is evaluated on the TDIUC dataset and is benchmarked against state-of-the-art approaches.
A Picture May Be Worth a Hundred Words for Visual Question Answering
TLDR
This paper proposes to take description-question pairs as input and fed them into a language-only Transformer model, simplifying the process and the computational cost, and experiments with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias.
Question-Driven Graph Fusion Network For Visual Question Answering
TLDR
A Question-Driven Graph Fusion Network (QD-GFN), which first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, and the model adopts an object filtering mechanism to remove question-irrelevant objects contained in the image.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 60 REFERENCES
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
An Analysis of Visual Question Answering Algorithms
TLDR
This paper analyzes existing VQA algorithms using a new dataset called the Task Driven Image Understanding Challenge (TDIUC), which has over 1.6 million questions organized into 12 different categories, and proposes new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms.
Answer-Type Prediction for Visual Question Answering
TLDR
This paper builds a system capable of answering open-ended text-based questions about images, known as Visual Question Answering (VQA), which can predict the form of the answer from the question in a Bayesian framework.
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
TLDR
GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.
Learning to Count Objects in Natural Images for Visual Question Answering
TLDR
A neural network component is proposed that allows robust counting from object proposals and is obtained state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with the authors' single model.
C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset
TLDR
This paper proposes a new setting for Visual Question Answering where the test question-answer pairs are compositionally novel compared to training question- answer pairs, and presents a new compositional split of the VQA v1.0 dataset, which it is called Compositional VZA (C-VQA).
Visual Question Answering as a Meta Learning Task
TLDR
This work adapts a state-of-the-art VQA model with two techniques from the recent meta learning literature, namely prototypical networks and meta networks, and produces qualitatively distinct results with higher recall of rare answers, and a better sample efficiency that allows training with little initial data.
Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering
TLDR
This work presents a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words, and shows through experiments that the proposed architecture achieves a new state-of-the-art on V QA and VQA 2.0 despite its small size.
...
1
2
3
4
5
...