• Corpus ID: 868693

Hierarchical Question-Image Co-Attention for Visual Question Answering

@article{Lu2016HierarchicalQC,
  title={Hierarchical Question-Image Co-Attention for Visual Question Answering},
  author={Jiasen Lu and Jianwei Yang and Dhruv Batra and Devi Parikh},
  journal={ArXiv},
  year={2016},
  volume={abs/1606.00061}
}
A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the… 

Figures and Tables from this paper

Co-Attention Network With Question Type for Visual Question Answering

A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA and demonstrates the effectiveness of the model as compared with several state-of-the-art approaches.

Structured Attentions for Visual Question Answering

This paper proposes to model the visual attention as a multivariate distribution over a grid-structured Con- ditional Random Field on image regions, and demonstrates how to convert the iterative inference algorithms, Mean Field and Loopy Belief Propagation, as recurrent layers of an end-to-end neural network.

Multimodal Attention in Recurrent Neural Networks for Visual Question Answering

This paper presents a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific question words to generate the answer.

Re-Attention for Visual Question Answering

A re-attention framework to utilize the information in answers for the VQA task by first learning the initial attention weights for the objects by calculating the similarity of each word-object pair in the feature space and introducing a gate mechanism to automatically control the contribution of re-Attention to model training based on the entropy of the learned initial visual attention maps.

Stacked Self-Attention Networks for Visual Question Answering

A VQA model that utilizes the stacked self-attention for visual understanding, and the BERT-based question embedding model that enables the model to not only focus on a simple object but also the relations between objects is proposed.

Multi-stage Attention based Visual Question Answering

This work proposes an alternating bi-directional attention framework that helps both the modalities and leads to better representations for the VQA task and is benchmark on TDIUC dataset and against state-of-art approaches.

Multimodal Attention for Visual Question Answering

  • L. KodraE. Meçe
  • Computer Science
    Advances in Intelligent Systems and Computing
  • 2018
This paper presents a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer.

Multi-Modality Global Fusion Attention Network for Visual Question Answering

A novel multi-modalityglobal fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives is proposed, which outperforms the previous state-of-the-art.

An Improved Attention for Visual Question Answering

This paper incorporates an Attention on Attention (AoA) module within encoder-decoder framework, which is able to determine the relation between attention results and queries, and proposes multimodal fusion module to combine both visual and textual information.

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

A Cubic Visual Attention (CVA) model is proposed by successfully applying a novel channel and spatial attention on object regions to improve VQA task and Experimental results show that the proposed method significantly outperforms the state-of-the-arts.
...

References

SHOWING 1-10 OF 32 REFERENCES

A Focused Dynamic Attention Model for Visual Question Answering

A novel Focused Dynamic Attention (FDA) model is proposed to provide better aligned image content representation with proposed questions and demonstrates the superior performance of FDA over well-established baselines on a large-scale benchmark dataset.

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

The VQA-HAT (Human ATtention) dataset is introduced and attention maps generated by state-of-the-art V QA models are evaluated against human attention both qualitatively and quantitatively.

Stacked Attention Networks for Image Question Answering

A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose

Exploring Models and Data for Image Question Answering

This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.

Learning to Answer Questions from Image Using Convolutional Neural Network

The proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer.

Yin and Yang: Balancing and Answering Binary Visual Questions

This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language