Areas of Attention for Image Captioning

@article{Pedersoli2017AreasOA,
  title={Areas of Attention for Image Captioning},
  author={Marco Pedersoli and Thomas Lucas and Cordelia Schmid and Jakob Verbeek},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1251-1259}
}
We propose “Areas of Attention”, a novel attentionbased model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions. In contrast to previous attentionbased approaches that associate image regions only to the RNN state, our method allows a direct association between caption words and image regions. During training these associations are inferred from image-level captions… 

Figures and Tables from this paper

Gated Hierarchical Attention for Image Captioning
TLDR
This paper proposes a bottom-up gated hierarchical attention (GHA) mechanism for image captioning in which low- level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions.
Neural Attention for Image Captioning: Review of Outstanding Methods
TLDR
This survey provides a review of literature related to attentive deep learning models for image captioning, and aims at finding the most successful types of attention mechanisms in deep models forimage captioning.
Multi-decoder Based Co-attention for Image Captioning
TLDR
A novel multi-decoder based co-attention framework for image captioning, which is composed of multiple decoders that integrate the detection-based mechanism and free-form region based attention mechanism, achieves state-of-the-art performance.
GateCap: Gated spatial and semantic attention model for image captioning
TLDR
A gated spatial and semantic attention captioning model (GateCap) which adaptively fuses spatial attention features with semantic attention features to achieve this goal and could reduce the side effect of the word-to-region misalignment at a time step over subsequent word prediction, thereby possibly alleviating emergence of incorrect words during testing.
Multiple-Level Feature-Based Network for Image Captioning
TLDR
A multiple-level feature-based network for image captioning that can lead to more accurate subject prediction and vivid description of sentences and outperform the state-of-the-art methods on MS-COCO dataset.
Graph Self-Attention Network for Image Captioning
  • Qitong Zheng, Yuping Wang
  • Computer Science
    2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA)
  • 2020
TLDR
A novel attention model, named graph self-attention (GSA), that incorporates graph networks and self-Attention for image captioning and can be applied to tasks that require attention to multiple features is proposed.
Boost Image Captioning with Knowledge Reasoning
TLDR
This paper proposes word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word, and introduces a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.
Object-aware semantics of attention for image captioning
TLDR
The object- Aware semantic attention object-aware semantic attention (OSA) based captioning model allows the explicit associations between the objects by coupling the attention mechanism with three types of semantic concepts, i.e., the category information, relative sizes of the objects, and relative distances between objects.
Hybrid Attention Distribution and Factorized Embedding Matrix in Image Captioning
TLDR
A hybrid attention distribution that allows multiple distributions to be reconstructed to express deeper internal relations and avoids a single shallow attention distribution is proposed and outperform existing state-of-the-art methods on some metrics.
Image Captioning Using Region-Based Attention Joint with Time-Varying Attention
TLDR
A novel region-based and time-varying attention network (RTAN) model for image captioning, which can determine where and when to attend to images, and can attend to only semantic information when predicting the non-visual words.
...
...

References

SHOWING 1-10 OF 52 REFERENCES
Attention Correctness in Neural Image Captioning
TLDR
It is shown on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
TLDR
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
Aligning where to see and what to tell: image caption with region-based attention and scene factorization
TLDR
This paper proposes an image caption system that exploits the parallel structures between images and sentences and makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image.
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
TLDR
The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Image Captioning with Semantic Attention
TLDR
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
TLDR
A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
Grounding of Textual Phrases in Images by Reconstruction
TLDR
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
...
...