VQA: Visual Question Answering

@article{Agrawal2015VQAVQ,
  title={VQA: Visual Question Answering},
  author={Aishwarya Agrawal and Jiasen Lu and Stanislaw Antol and Margaret Mitchell and C. Lawrence Zitnick and Devi Parikh and Dhruv Batra},
  journal={International Journal of Computer Vision},
  year={2015},
  volume={123},
  pages={4-31}
}
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs… 
Visual Question Answering Using Semantic Information from Image Descriptions
TLDR
A deep neural network model is proposed that uses an attention mechanism which utilizes image features, the natural language question asked and semantic knowledge extracted from the image to produce open-ended answers for the given questions.
Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions
TLDR
These approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks and are shown to be more intelligent, reasonable, and human-like than previous approaches.
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
TLDR
This work introduces A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer, and demonstrates the potential of this new dataset through a detailed analysis of its contents and baseline performance measurements over a variety of state-of-the-art vision–language models.
Question Relevance in Visual Question Answering
TLDR
A large dataset is generated from existing visual question answering datasets in order to enable the training of complex architectures and model the relevance of a visual question to an image.
Visual Question Answering With Enhanced Question-Answer Diversity
TLDR
A method of enhancing the question-answer (QA) input data, using the Visual Genome dataset to automatically generate new QA-pairs by using its extensive image annotations, suggests that the method of data augmentation improves a VQA model’s robustness to unseen data.
Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?
TLDR
This work proposes a potentially data-efficient approach that reuses existing systems for image analysis, question rewriting, and text-based question answering to answer many visual questions, and explores two rewriting strategies that combines adaptive rewriting and reinforcement learning techniques to use the implicit feedback from the QA system.
Proposing Plausible Answers for Open-ended Visual Question Answering
TLDR
This work provides both intrinsic and extrinsic evaluations for the task of Answer Proposal, showing that the best model learns to propose plausible answers with a high recall and performs competitively with some other solutions to VQA.
Revisiting Visual Question Answering Baselines
TLDR
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
FVQA: Fact-Based Visual Question Answering
TLDR
A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.
...
...

References

SHOWING 1-10 OF 76 REFERENCES
Yin and Yang: Balancing and Answering Binary Visual Questions
TLDR
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
TLDR
The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
TLDR
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.
Exploring Models and Data for Image Question Answering
TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
TLDR
This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Open question answering over curated and extracted knowledge bases
TLDR
This paper presents OQA, the first approach to leverage both curated and extracted KBs, and demonstrates that it achieves up to twice the precision and recall of a state-of-the-art Open QA system.
Visual Madlibs: Fill in the Blank Description Generation and Question Answering
TLDR
A new dataset consisting of 360,001 focused natural language descriptions for 10,738 images is introduced and its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images is demonstrated.
...
...