• Corpus ID: 232320648

Multi-Modal Answer Validation for Knowledge-Based VQA

  title={Multi-Modal Answer Validation for Knowledge-Based VQA},
  author={Jialin Wu and Jiasen Lu and Ashish Sabharwal and Roozbeh Mottaghi},
The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and commonsense knowledge. Using more knowledge sources increases the chance of retrieving more irrelevant or noisy facts, making it challenging to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using… 

Figures and Tables from this paper

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
This work proposes PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA, and treats GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge.
Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
A Visual Retriever-Reader pipeline to approach knowledge-based VQA, which introduces various ways to retrieve knowledge using text and images and two reader styles: classification and extraction and shows that a good retriever can significantly improve the reader's performance on the OK-VQA challenge.
Multimodal Few-Shot Learning with Frozen Language Models
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering
A unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models is proposed, showing that the text-only model outperforms pretrained multimodal models of comparable number of parameters on a visual question answering task.
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering
A Transform-Retrieve-Generate framework (TRiG) framework is proposed, which can be plug-and-played with alternative image-to-text models and textual knowledge bases, and outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.
Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
This work empirically study how and whether knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve an existing VQA system’s performance on the KBVQA task, and provides insights for when entity knowledge injection helps improve a model's understanding.
K-LITE: Learning Transferable Visual Models with External Knowledge
This paper proposes K-L ITE, a simple strategy to leverage external knowledge to build transferable visual systems, and proposes knowledge-augmented models that show signs of improvement in transfer learning performance over existing methods.
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
This paper proposes a novel model that can capture relevant knowledge and incorporate them into the vision and semantic features, via graph-based interaction and transformer-based fusion, and effectively outperforms comparative methods.
KAT: A Knowledge Augmented Transformer for Vision-and-Language
This work proposes a KAT model, which achieves a strong state-of-the-art result (+6% absolute) on the open-domain multimodal task of OK-VQA, and integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.


Boosting Visual Question Answering with Context-aware Knowledge Aggregation
The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions.
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
This work study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time, and significantly out-performs state-of-the-art on OK-VQA, the largest available dataset for open- domain knowledge-based VQA.
FVQA: Fact-Based Visual Question Answering
A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.
ConceptBert: Concept-Aware Representation for Visual Question Answering
This work presents a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content, and introduces a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture.
Explicit Knowledge-based Reasoning for Visual Question Answering
A method for visual question answering which is capable of reasoning about contents of an image on the basis of information extracted from a large-scale knowledge base is described, addressing one of the key issues in general visual answering.
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
The injection of knowledge from general-purpose knowledge bases into vision-and-language transformers is investigated and it is shown that the injection of additional knowledge regularizes the space of embeddings, which improves the representation of lexical and semantic similarities.
PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text
PullNet is described, an integrated framework for learning what to retrieve and reasoning with this heterogeneous information to find the best answer in an open-domain question answering setting.
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering
A learning-based approach which goes straight to the facts via a learned embedding space is developed and demonstrated state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset are demonstrated.
Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering
This work develops an entity graph and uses a graph convolutional network to `reason' about the correct answer by jointly considering all entities and shows that this leads to an improvement in accuracy of around 7% compared to the state of the art.