Answer Them All! Toward Universal Visual Question Answering Models

@article{Shrestha2019AnswerTA,
  title={Answer Them All! Toward Universal Visual Question Answering Models},
  author={Robik Shrestha and Kushal Kafle and Christopher Kanan},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2019},
  pages={10464-10473}
}
Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g… 
Comparative Study of Visual Question Answering Algorithms
Visual Question Answering (VQA) is a recent task that challenges algorithms to reason about the visual content of an image to be able to answer a natural language question. In this study, we compare
Visual question answering: a state-of-the-art review
TLDR
This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics and discusses future research directions for all the above-mentioned aspects of V QA separately.
WeaQA: Weak Supervision via Captions for Visual Question Answering
TLDR
This work presents a method to train models with synthetic Q-A pairs generated procedurally from captions, and demonstrates the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models.
A survey of methods, datasets and evaluation metrics for visual question answering
TLDR
This paper has discussed some of the core concepts used in VQA systems and presented a comprehensive survey of efforts in the past to address this problem, and discussed some new datasets developed in 2019 and 2020.
RUBi: Reducing Unimodal Biases in Visual Question Answering
TLDR
RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
Answering Questions about Data Visualizations using Efficient Bimodal Fusion
TLDR
This work proposes a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL), which first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learnedembeddings to answer the given question.
A negative case analysis of visual grounding methods for VQA
TLDR
It is found that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements, and a simpler regularization scheme is proposed that achieves near state-of-the-art performance on VQA-CPv2.
CQ-VQA: Visual Question Answering on Categorized Questions
TLDR
A novel two-level hierarchical but end-to-end model to solve the task of visual question answering (VQA) is proposed and is evaluated on the TDIUC dataset and is benchmarked against state-of-the-art approaches.
A Picture May Be Worth a Hundred Words for Visual Question Answering
TLDR
This paper proposes to take description-question pairs as input and fed them into a language-only Transformer model, simplifying the process and the computational cost, and experiments with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 60 REFERENCES
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
An Analysis of Visual Question Answering Algorithms
TLDR
This paper analyzes existing VQA algorithms using a new dataset called the Task Driven Image Understanding Challenge (TDIUC), which has over 1.6 million questions organized into 12 different categories, and proposes new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms.
Visual question answering: Datasets, algorithms, and future challenges
TLDR
This review critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms, and exhaustively review existing algorithms for V QA.
Visual question answering: A survey of methods and datasets
TLDR
The state of the art by comparing modern approaches to VQA, and the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space are examined.
Answer-Type Prediction for Visual Question Answering
TLDR
This paper builds a system capable of answering open-ended text-based questions about images, known as Visual Question Answering (VQA), which can predict the form of the answer from the question in a Bayesian framework.
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
TLDR
GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.
Learning to Count Objects in Natural Images for Visual Question Answering
TLDR
A neural network component is proposed that allows robust counting from object proposals and is obtained state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with the authors' single model.
C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset
TLDR
This paper proposes a new setting for Visual Question Answering where the test question-answer pairs are compositionally novel compared to training question- answer pairs, and presents a new compositional split of the VQA v1.0 dataset, which it is called Compositional VZA (C-VQA).
Visual Question Answering as a Meta Learning Task
TLDR
This work adapts a state-of-the-art VQA model with two techniques from the recent meta learning literature, namely prototypical networks and meta networks, and produces qualitatively distinct results with higher recall of rare answers, and a better sample efficiency that allows training with little initial data.
Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering
TLDR
This work presents a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words, and shows through experiments that the proposed architecture achieves a new state-of-the-art on V QA and VQA 2.0 despite its small size.
...
1
2
3
4
5
...