Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
@article{Lu2016KnowingWT, title={Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning}, author={Jiasen Lu and Caiming Xiong and Devi Parikh and Richard Socher}, journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2016}, pages={3242-3250} }
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a…
Figures and Tables from this paper
1,125 Citations
Hierarchical LSTMs with Adaptive Attention for Visual Captioning
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2020
A hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning that utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information.
Looking Back and Forward: Enhancing Image Captioning with Global Semantic Guidance
- Computer Science2021 International Joint Conference on Neural Networks (IJCNN)
- 2021
A novel Temporal-Free Semantic-Guided attention mechanism (TFSG) to utilize the raw caption pre-generated by a primary decoder as the extra input to provide global semantic guidance during generation, deepening visual understanding by balancing the semantic and visual information.
VSAM-Based Visual Keyword Generation for Image Caption
- Computer ScienceIEEE Access
- 2021
An image dataset derived from MSCOCO is presented as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD) and a Visual Semantic Attention Model (VSAM) is proposed to obtain visual keywords for generating the annotation.
Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning
- Computer ScienceBICS
- 2018
This paper proposes a memory-enhanced attention model for image captioning, aiming to improve the attention mechanism with previous learned knowledge, and stores the visual and semantic knowledge which has been exploited in the past into memories and generates a global visual or semantic feature to improve this model.
Learning visual relationship and context-aware attention for image captioning
- Computer SciencePattern Recognit.
- 2020
RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This paper proposes Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations, and proposes Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction.
Deliberate Attention Networks for Image Captioning
- Computer ScienceAAAI
- 2019
This paper presents a novel Deliberate Residual Attention Network, namely DA, for image captioning, which is equipped with discriminative loss and reinforcement learning to disambiguate image/caption pairs and reduce exposure bias.
Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning
- Computer ScienceACM Multimedia
- 2017
This work designs a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process, and employs a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources.
Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder
- Computer ScienceArXiv
- 2019
A two-stream attention mechanism that can automatically discover latent categories and relate them to image regions based on the previously generated words and a regularization technique that encapsulates the syntactic and semantic structure of captions and improves the optimization of the image captioning model are proposed.
Boost image captioning with knowledge reasoning
- Computer ScienceMachine Learning
- 2020
This paper proposes word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word, and introduces a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.
References
SHOWING 1-10 OF 38 REFERENCES
Hierarchical Question-Image Co-Attention for Visual Question Answering
- Computer ScienceNIPS
- 2016
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Mind's eye: A recurrent visual representation for image caption generation
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper explores the bi-directional mapping between images and their sentence-based descriptions with a recurrent neural network that attempts to dynamically build a visual representation of the scene as a caption is being generated or read.
From captions to visual concepts and back
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Show and tell: A neural image caption generator
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Computer ScienceICML
- 2015
An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Image Captioning with Semantic Attention
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
Boosting Image Captioning with Attributes
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image.
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
- Computer ScienceICLR
- 2015
The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.