Multi-Modal Answer Validation for Knowledge-Based VQA
@article{Wu2021MultiModalAV, title={Multi-Modal Answer Validation for Knowledge-Based VQA}, author={Jialin Wu and Jiasen Lu and Ashish Sabharwal and Roozbeh Mottaghi}, journal={ArXiv}, year={2021}, volume={abs/2103.12248} }
The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and commonsense knowledge. Using more knowledge sources increases the chance of retrieving more irrelevant or noisy facts, making it challenging to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using…
10 Citations
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
- Computer ScienceArXiv
- 2021
This work proposes PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA, and treats GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge.
Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
- Computer ScienceEMNLP
- 2021
A Visual Retriever-Reader pipeline to approach knowledge-based VQA, which introduces various ways to retrieve knowledge using text and images and two reader styles: classification and extraction and shows that a good retriever can significantly improve the reader's performance on the OK-VQA challenge.
Multimodal Few-Shot Learning with Frozen Language Models
- Computer ScienceNeurIPS
- 2021
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings.
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering
- Computer ScienceArXiv
- 2021
A unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models is proposed, showing that the text-only model outperforms pretrained multimodal models of comparable number of parameters on a visual question answering task.
A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering
- Computer Science
- 2022
A Transform-Retrieve-Generate framework (TRiG) framework is proposed, which can be plug-and-played with alternative image-to-text models and textual knowledge bases, and outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.
Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection
- Computer ScienceArXiv
- 2021
This work empirically study how and whether knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve an existing VQA system’s performance on the KBVQA task, and provides insights for when entity knowledge injection helps improve a model's understanding.
K-LITE: Learning Transferable Visual Models with External Knowledge
- Computer ScienceArXiv
- 2022
This paper proposes K-L ITE, a simple strategy to leverage external knowledge to build transferable visual systems, and proposes knowledge-augmented models that show signs of improvement in transfer learning performance over existing methods.
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
- Computer ScienceArXiv
- 2022
This paper proposes a novel model that can capture relevant knowledge and incorporate them into the vision and semantic features, via graph-based interaction and transformer-based fusion, and effectively outperforms comparative methods.
Information retrieval and question answering: A case study on COVID-19 scientific literature
- Computer ScienceKnowledge-Based Systems
- 2021
KAT: A Knowledge Augmented Transformer for Vision-and-Language
- Computer ScienceArXiv
- 2021
This work proposes a KAT model, which achieves a strong state-of-the-art result (+6% absolute) on the open-domain multimodal task of OK-VQA, and integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.
References
SHOWING 1-10 OF 59 REFERENCES
Boosting Visual Question Answering with Context-aware Knowledge Aggregation
- Computer ScienceACM Multimedia
- 2020
The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions.
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time, and significantly out-performs state-of-the-art on OK-VQA, the largest available dataset for open- domain knowledge-based VQA.
FVQA: Fact-Based Visual Question Answering
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2018
A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.
ConceptBert: Concept-Aware Representation for Visual Question Answering
- Computer ScienceFINDINGS
- 2020
This work presents a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content, and introduces a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture.
Explicit Knowledge-based Reasoning for Visual Question Answering
- Computer ScienceIJCAI
- 2017
A method for visual question answering which is capable of reasoning about contents of an image on the basis of information extracted from a large-scale knowledge base is described, addressing one of the key issues in general visual answering.
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
- Computer ScienceLANTERN
- 2021
The injection of knowledge from general-purpose knowledge bases into vision-and-language transformers is investigated and it is shown that the injection of additional knowledge regularizes the space of embeddings, which improves the representation of lexical and semantic similarities.
PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text
- Computer ScienceEMNLP
- 2019
PullNet is described, an integrated framework for learning what to retrieve and reasoning with this heterogeneous information to find the best answer in an open-domain question answering setting.
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering
- Computer ScienceECCV
- 2018
A learning-based approach which goes straight to the facts via a learned embedding space is developed and demonstrated state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset are demonstrated.
Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering
- Computer ScienceNeurIPS
- 2018
This work develops an entity graph and uses a graph convolutional network to `reason' about the correct answer by jointly considering all entities and shows that this leads to an improvement in accuracy of around 7% compared to the state of the art.