Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
@article{Anderson2017BottomUpAT, title={Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering}, author={Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang}, journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2017}, pages={6077-6086} }
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. [] Key Method Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server…
Figures and Tables from this paper
3,046 Citations
Cascading Top-Down Attention for Visual Question Answering
- Computer Science2020 International Joint Conference on Neural Networks (IJCNN)
- 2020
A Cascading Top-Down Attention (CTDA) model is proposed, which highlights the most important information collected from images and questions by a cascading attention process and obtains better results than standard TDA and the other state of the art models.
Bottom-up and Top-down Object Inference Networks for Image Captioning
- Computer ScienceACM Transactions on Multimedia Computing, Communications, and Applications
- 2023
This work presents Bottom-up and Top-down Object inference Networks (BTO-Net), that novelly exploits the object sequence of interest as top-down signals to guide image captioning and obtains competitive performances on COCO benchmark.
In Defense of Grid Features for Visual Question Answering
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
Multi-stage Attention based Visual Question Answering
- Computer Science2020 25th International Conference on Pattern Recognition (ICPR)
- 2021
This work proposes an alternating bi-directional attention framework that helps both the modalities and leads to better representations for the VQA task and is benchmark on TDIUC dataset and against state-of-art approaches.
Gated Hierarchical Attention for Image Captioning
- Computer ScienceACCV
- 2018
This paper proposes a bottom-up gated hierarchical attention (GHA) mechanism for image captioning in which low- level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions.
Question Type Guided Attention in Visual Question Answering
- Computer ScienceECCV
- 2018
This work proposes Question Type-guided Attention (QTA), which utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks.
Re-Attention for Visual Question Answering
- Computer ScienceIEEE Transactions on Image Processing
- 2021
A re-attention framework to utilize the information in answers for the VQA task by first learning the initial attention weights for the objects by calculating the similarity of each word-object pair in the feature space and introducing a gate mechanism to automatically control the contribution of re-Attention to model training based on the entropy of the learned initial visual attention maps.
Co-Attention Network With Question Type for Visual Question Answering
- Computer ScienceIEEE Access
- 2019
A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA and demonstrates the effectiveness of the model as compared with several state-of-the-art approaches.
A Bottom-Up and Top-Down Approach for Image Captioning using Transformer
- Computer ScienceICVGIP
- 2018
Two novel approaches are proposed, a top-down and a bottom-up approach independently, which dispenses the recurrence entirely by incorporating the use of a Transformer, a network architecture for generating sequences relying entirely on the mechanism of attention.
Residual Self-Attention for Visual Question Answering
- Computer Science2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE)
- 2019
A multi-stage attention model has been put forward, that is, for the image, the bottom-up attention and residual self-attention for theimage itself and the question-guided double-headed soft (top-down) attention method were used to extract the image features.
67 References
Hierarchical Question-Image Co-Attention for Visual Question Answering
- Computer ScienceNIPS
- 2016
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Areas of Attention for Image Captioning
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
The attention mechanism and spatial transformer attention areas together yield state-of-the-art results on the MSCOCO dataset.
Visual7W: Grounded Question Answering in Images
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
Aligning where to see and what to tell: image caption with region-based attention and scene factorization
- Computer ScienceArXiv
- 2015
This paper proposes an image caption system that exploits the parallel structures between images and sentences and makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image.
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
- Computer ScienceECCV
- 2016
The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer.
Revisiting Visual Question Answering Baselines
- Computer ScienceECCV
- 2016
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
Image Captioning with Semantic Attention
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering
- Computer ScienceArXiv
- 2017
This model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark.