Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

  title={Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning},
  author={Jiasen Lu and Caiming Xiong and Devi Parikh and Richard Socher},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Jiasen LuCaiming Xiong R. Socher
  • Published 6 December 2016
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell. In this paper, we propose a… 

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

A hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning that utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information.

Looking Back and Forward: Enhancing Image Captioning with Global Semantic Guidance

A novel Temporal-Free Semantic-Guided attention mechanism (TFSG) to utilize the raw caption pre-generated by a primary decoder as the extra input to provide global semantic guidance during generation, deepening visual understanding by balancing the semantic and visual information.

VSAM-Based Visual Keyword Generation for Image Caption

An image dataset derived from MSCOCO is presented as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD) and a Visual Semantic Attention Model (VSAM) is proposed to obtain visual keywords for generating the annotation.

Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning

This paper proposes a memory-enhanced attention model for image captioning, aiming to improve the attention mechanism with previous learned knowledge, and stores the visual and semantic knowledge which has been exploited in the past into memories and generates a global visual or semantic feature to improve this model.

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

This paper proposes Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations, and proposes Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction.

Deliberate Attention Networks for Image Captioning

This paper presents a novel Deliberate Residual Attention Network, namely DA, for image captioning, which is equipped with discriminative loss and reinforcement learning to disambiguate image/caption pairs and reduce exposure bias.

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

This work designs a key control unit, termed visual gate, to adaptively decide "when" and "what" the language generator attend to during the word generation process, and employs a bottom-up workflow to learn a pool of semantic attributes for serving as the propositional attention resources.

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

A two-stream attention mechanism that can automatically discover latent categories and relate them to image regions based on the previously generated words and a regularization technique that encapsulates the syntactic and semantic structure of captions and improves the optimization of the image captioning model are proposed.

Boost image captioning with knowledge reasoning

This paper proposes word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word, and introduces a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.



Hierarchical Question-Image Co-Attention for Visual Question Answering

This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Mind's eye: A recurrent visual representation for image caption generation

This paper explores the bi-directional mapping between images and their sentence-based descriptions with a recurrent neural network that attempts to dynamically build a visual representation of the scene as a caption is being generated or read.

From captions to visual concepts and back

This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

Image Captioning with Semantic Attention

This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

What Value Do Explicit High Level Concepts Have in Vision to Language Problems?

A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.

Boosting Image Captioning with Attributes

This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image.

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.