A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

  title={A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge},
  author={Dustin Schwenk and Apoorv Khandelwal and Christopher Clark and Kenneth Marino and Roozbeh Mottaghi},
The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, little world knowledge needed outside of the paired image, and limited reasoning required to arrive at… 


Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering
This work identifies a key structural idiom in OKVQA ,viz.
From Recognition to Cognition: Visual Commonsense Reasoning
To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
FVQA: Fact-Based Visual Question Answering
A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Microsoft COCO Captions: Data Collection and Evaluation Server
The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.
Explicit knowledgebased reasoning for visual question answering
  • In IJCAI, 2017
  • 2017
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
This work study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time, and significantly out-performs state-of-the-art on OK-VQA, the largest available dataset for open- domain knowledge-based VQA.
Webly Supervised Concept Expansion for General Purpose Vision Models
This work uses a dataset of 1M+ images spanning 10k+ visual concepts to demonstrate webly-supervised concept expansion for two existing GPVs and proposes a new architecture, GPV-2 that supports a variety of tasks — from vision tasks like classification and localization to vision+language tasks like QA and captioning, to more niche ones like human-object interaction detection.