• Corpus ID: 245124118

Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection

  title={Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection},
  author={Diego Garcia-Olano and Yasumasa Onoe and Joydeep Ghosh},
Knowledge-Based Visual Question Answering (KBVQA) is a bimodal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single modality text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks. In this work, we empirically study how and whether such methods, applied in a bi-modal setting, can improve an… 

Figures and Tables from this paper


Boosting Visual Question Answering with Context-aware Knowledge Aggregation
The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions.
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
This work proposes PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA, and treats GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge.
ConceptBert: Concept-Aware Representation for Visual Question Answering
This work presents a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content, and introduces a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture.
KVQA: Knowledge-Aware Visual Question Answering
KVQA is introduced – the first dataset for the task of (world) knowledge-aware VQA and is the largest dataset for exploring V QA over large Knowledge Graphs (KG), which consists of 183K question-answer pairs involving more than 18K named entities and 24K images.
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
The injection of knowledge from general-purpose knowledge bases into vision-and-language transformers is investigated and it is shown that the injection of additional knowledge regularizes the space of embeddings, which improves the representation of lexical and semantic similarities.
FVQA: Fact-Based Visual Question Answering
A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.
From Two Graphs to N Questions: A VQA Dataset for Compositional Reasoning on Vision and Commonsense
This paper presents a new compositional model that is capable of implementing various types of reasoning functions on the image content and the knowledge graph and develops a powerful method to automatically generate compositional questions and rich annotations from both the scene graph of a given image and some external knowledge graph.
From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason
This work presents a VQA model which can read scene texts and perform reasoning on a knowledge graph to arrive at an accurate answer, and is the first dataset which identifies the need for bridging text recognition with knowledge graph based reasoning.
Language Models as Knowledge Bases?
An in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models finds that BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge.