OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

  title={OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge},
  author={Kenneth Marino and Mohammad Rastegari and Ali Farhadi and Roozbeh Mottaghi},
  journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. [] Key Result Our analysis shows that our knowledge-based VQA task is diverse, difficult, and large compared to previous knowledge-based VQA datasets. We hope that this dataset enables researchers to open up new avenues for research in this domain.

Figures and Tables from this paper

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

This work introduces A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer, and demonstrates the potential of this new dataset through a detailed analysis of its contents and baseline performance mea-surements over a variety of state-of-the-art vision–language models.

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

This paper proposes Agent Interaction Visual Question Answering (AI-VQA), a task investigating deep scene understanding if the agent takes a certain action, and proposes a novel method, called ARE, that can comprehend the interaction and explain the reason based on a given event knowledge base.

ConceptBert: Concept-Aware Representation for Visual Question Answering

This work presents a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content, and introduces a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture.

Evaluating State-of-the-Art Visual Question Answering Models Ability to Answer Complex Counting Questions

This paper incorporates the four basic mathematical operations into the ‘counting’ questions of the CLEVR dataset and compares how different models fair against this modified dataset of 100,00 images and 2.4 million questions to open new pathways for the future.

A Dataset and Baselines for Visual Question Answering on Art

This work introduces the first attempt towards building a new dataset, coined AQUA (Art QUestion Answering), where question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset.

Can Open Domain Question Answering Systems Answer Visual Knowledge Questions?

This work proposes a potentially data-efficient approach that reuses existing systems for image analysis, question rewriting, and text-based question answering to answer many visual questions, and explores two rewriting strategies that combines adaptive rewriting and reinforcement learning techniques to use the implicit feedback from the QA system.

Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

A novel data set named knowledge-routed visual question reasoning is proposed, which aims to cut off the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question Reasoning.

Coarse-to-Fine Reasoning for Visual Question Answering

This paper proposes a new reasoning framework to fill the gap between visual features and semantic clues in the VQA task and achieves superior accuracy comparing with other state-of-the-art methods.



FVQA: Fact-Based Visual Question Answering

A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Hierarchical Question-Image Co-Attention for Visual Question Answering

This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

This work develops an entity graph and uses a graph convolutional network to `reason' about the correct answer by jointly considering all entities and shows that this leads to an improvement in accuracy of around 7% compared to the state of the art.

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

A novel framework is proposed which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks and can also answer open-domain questions effectively by leveraging the external knowledge.

Building a Large-scale Multimodal Knowledge Base for Visual Question Answering

This work builds a multimodal knowledge base (KB) incorporating visual, textual and structured data, as well as their diverse relations for visual QA, and introduces a scalable knowledge base construction system by leveraging database techniques.

Explicit Knowledge-based Reasoning for Visual Question Answering

A method for visual question answering which is capable of reasoning about contents of an image on the basis of information extracted from a large-scale knowledge base is described, addressing one of the key issues in general visual answering.