Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering

  title={Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering},
  author={Zhou Yu and Jun Yu and Chenchao Xiang and Jianping Fan and Dacheng Tao},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  • Zhou Yu, Jun Yu, D. Tao
  • Published 10 August 2017
  • Computer Science
  • IEEE Transactions on Neural Networks and Learning Systems
Visual question answering (VQA) is challenging, because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multimodal feature fusion that is able to capture the complex interactions between multimodal features; and 3) automatic answer prediction that is able to consider the… 

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

A graph matching attention (GMA) network that not only buildsgraph for the image but also constructs graph for the question in terms of both syntactic and embedding information and achieves the state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.

Question-Agnostic Attention for Visual Question Answering

This paper proposes a question-agnostic attention mechanism that is complementary to the existing question-dependent attention mechanisms, and shows that incorporating complementary QAA allows state-of-the-art VQA models to perform better, and provides significant boost to simplistic V QA models, enabling them to performance on par with highly sophisticated fusion strategies.

Question Splitting and Unbalanced Multi-modal Pooling for VQA

A question splitting and unbalanced multi-modal pooling approach for visual question answering that introduces the co-attention mechanism and is superior to the previous models such as Oracle, SAN and QRU.

Deep Modular Bilinear Attention Network for Visual Question Answering

A deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra- modality relations and can be cascaded in depth.

Second Order Enhanced Multi-glimpse Attention in Visual Question Answering

The idea of second order interactions of different modalities, which is prevalent in recommendation system, is re-purposed to VQA in efficiently and explicitly modeling the second order interaction on both the visual and textual features, learned in a shared embedding space.

Multi-Modality Global Fusion Attention Network for Visual Question Answering

A novel multi-modalityglobal fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives is proposed, which outperforms the previous state-of-the-art.

An Improved Attention for Visual Question Answering

This paper incorporates an Attention on Attention (AoA) module within encoder-decoder framework, which is able to determine the relation between attention results and queries, and proposes multimodal fusion module to combine both visual and textual information.

Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

The Variational Autoencoder (VAE) method was applied to calculate the probability distribution of the hidden variables of image and question text and a question feature hierarchy method was designed based on the traditional attention mechanism model and VAE to improve the accuracy of VQA tasks.

Multimodal Encoder-Decoder Attention Networks for Visual Question Answering

A novel Multimodal Encoder-Decoder Attention Networks (MEDAN) that can capture rich and reasonable question features and image features by associating keywords in question with important object regions in image is proposed.



Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.

A Focused Dynamic Attention Model for Visual Question Answering

A novel Focused Dynamic Attention (FDA) model is proposed to provide better aligned image content representation with proposed questions and demonstrates the superior performance of FDA over well-established baselines on a large-scale benchmark dataset.

Hierarchical Question-Image Co-Attention for Visual Question Answering

This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer.

ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering

The proposed ABC-CNN architecture for visual question answering task (VQA) achieves significant improvements over state-of-the-art methods on three benchmark VQA datasets and is shown to reflect the regions that are highly relevant to the questions.

Learning Convolutional Text Representations for Visual Question Answering

This work performs a detailed analysis on natural language questions in visual question answering and proposes to rely on convolutional neural networks for learning textual representations, and presents the "CNN Inception + Gate" model, which improves question representations and thus the overall accuracy ofVisual question answering models.

Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources

A method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and is specifically able to answer questions posed in natural language, that refer to information not contained in the image.

Multimodal Residual Learning for Visual QA

This work presents Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.