Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

  title={Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering},
  author={J. Yu and Zihao Zhu and Yujing Wang and Weifeng Zhang and Yue Hu and Jianlong Tan},

Figures and Tables from this paper

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

This paper proposes MuKEA to represent multimodal knowledge by an explicit triplet to correlate visual objects and fact answers with implicit relations and proposes three objective losses to learn the triplet representations from complementary views: embedding structure, topological relation and semantic space.

Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering

This paper proposes a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing problems of collaborative reasoning in knowledge-based visual question answering and achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset.

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering

A novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively, achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets.

From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering

  • Zihao Zhu
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
A Hierarchical Graph Neural Module Network (HGNMN) that reasons over multi-layer graphs with neural modules to address the above issues and achieves state-of-the-art performance on the CRIC dataset.

Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

This work identifies a key structural idiom in OKVQA ,viz.

SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering

This work converts different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual andtextual modalities.

Knowledge is Power: Hierarchical-Knowledge Embedded Meta-Learning for Visual Reasoning in Artistic Domains

This paper presents a deep relational model to capture and memorize the relations among different samples and provides the hierarchical-knowledge embedding that mines the implicit relationship between question-answer pairs for knowledge representation as the guidance of the meta-learner.

Visual Question Answering using Deep Learning: A Survey and Performance Analysis

This survey covers and discusses the recent datasets released in the VQA domain dealing with various types of question-formats and enabling robustness of the machine-learning models, and presents and discusses some of the results computed by us over the vanilla V QA models, Stacked Attention Network and the VqA Challenge 2017 winner model.



Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

A novel framework is proposed which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks and can also answer open-domain questions effectively by leveraging the external knowledge.

Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

This work develops an entity graph and uses a graph convolutional network to `reason' about the correct answer by jointly considering all entities and shows that this leads to an improvement in accuracy of around 7% compared to the state of the art.

FVQA: Fact-Based Visual Question Answering

A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts.

Explicit Knowledge-based Reasoning for Visual Question Answering

A method for visual question answering which is capable of reasoning about contents of an image on the basis of information extracted from a large-scale knowledge base is described, addressing one of the key issues in general visual answering.

Visual Question Answering as Reading Comprehension

This paper proposes to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem, which is a step towards being able to exploit large volumes of text and natural language processing techniques to address V QA problem.

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.

Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

A learning-based approach which goes straight to the facts via a learned embedding space is developed and demonstrated state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset are demonstrated.

Learning Conditioned Graph Structures for Interpretable Visual Question Answering

This paper proposes a novel graph-based approach for Visual Question Answering that combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions.

Relation-Aware Graph Attention Network for Visual Question Answering

A Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations.

Chain of Reasoning for Visual Question Answering

A chain of reasoning (CoR) is constructed for supporting multi-step and dynamic reasoning on changed relations and objects and achieves new state-of-the-art results on four publicly available datasets.