Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

  title={Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  author={Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. [] Key Method Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server…

Figures and Tables from this paper

Cascading Top-Down Attention for Visual Question Answering

A Cascading Top-Down Attention (CTDA) model is proposed, which highlights the most important information collected from images and questions by a cascading attention process and obtains better results than standard TDA and the other state of the art models.

Bottom-up and Top-down Object Inference Networks for Image Captioning

This work presents Bottom-up and Top-down Object inference Networks (BTO-Net), that novelly exploits the object sequence of interest as top-down signals to guide image captioning and obtains competitive performances on COCO benchmark.

In Defense of Grid Features for Visual Question Answering

This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).

Multi-stage Attention based Visual Question Answering

This work proposes an alternating bi-directional attention framework that helps both the modalities and leads to better representations for the VQA task and is benchmark on TDIUC dataset and against state-of-art approaches.

Gated Hierarchical Attention for Image Captioning

This paper proposes a bottom-up gated hierarchical attention (GHA) mechanism for image captioning in which low- level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions.

Question Type Guided Attention in Visual Question Answering

This work proposes Question Type-guided Attention (QTA), which utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks.

Re-Attention for Visual Question Answering

A re-attention framework to utilize the information in answers for the VQA task by first learning the initial attention weights for the objects by calculating the similarity of each word-object pair in the feature space and introducing a gate mechanism to automatically control the contribution of re-Attention to model training based on the entropy of the learned initial visual attention maps.

Co-Attention Network With Question Type for Visual Question Answering

A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA and demonstrates the effectiveness of the model as compared with several state-of-the-art approaches.

A Bottom-Up and Top-Down Approach for Image Captioning using Transformer

Two novel approaches are proposed, a top-down and a bottom-up approach independently, which dispenses the recurrence entirely by incorporating the use of a Transformer, a network architecture for generating sequences relying entirely on the mechanism of attention.

Residual Self-Attention for Visual Question Answering

A multi-stage attention model has been put forward, that is, for the image, the bottom-up attention and residual self-attention for theimage itself and the question-guided double-headed soft (top-down) attention method were used to extract the image features.

Hierarchical Question-Image Co-Attention for Visual Question Answering

This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.

Areas of Attention for Image Captioning

The attention mechanism and spatial transformer attention areas together yield state-of-the-art results on the MSCOCO dataset.

Visual7W: Grounded Question Answering in Images

A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.

What Value Do Explicit High Level Concepts Have in Vision to Language Problems?

A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.

Aligning where to see and what to tell: image caption with region-based attention and scene factorization

This paper proposes an image caption system that exploits the parallel structures between images and sentences and makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image.

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer.

Revisiting Visual Question Answering Baselines

The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.

Image Captioning with Semantic Attention

This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

This model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark.