Rethinking the Form of Latent States in Image Captioning

  title={Rethinking the Form of Latent States in Image Captioning},
  author={Bo Dai and Deming Ye and Dahua Lin},
  booktitle={European Conference on Computer Vision},
RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant… 

A Neural Compositional Paradigm for Image Captioning

This paper presents an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: extracting an explicit semantic representation from the given image and constructing the caption based on a recursive compositional procedure in a bottom-up manner.

Macroscopic Control of Text Generation for Image Captioning

A control signal is introduced which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc, and a strategy is innovatively proposed that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one.

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

A Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information and a cross-modal attention mechanism to retouch the two sentences by fusing their salient parts as well as the salient areas of the image.

From Show to Tell: A Survey on Deep Learning-Based Image Captioning

This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.

Structure Preserving Convolutional Attention for Image Captioning

A convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps by aiming to determine the intended regions to describe the image along the spatial and channel dimensions.

Panoptic Segmentation-Based Attention for Image Captioning

This work proposes panoptic segmentation-based attention that performs attention at a mask-level (i.e., the shape of the main part of an instance) and extracts feature vectors from the corresponding segmentation regions, which is more fine-grained than current attention mechanisms.

Sequential image encoding for vision-to-language problems

Experimental results on the image captioning and VQA benchmarks demonstrate the hypothesis it’s beneficial to appropriately arrange objects sequence on the Vision-to-Language (V2L) problems.

Image Captioning based on Deep Learning Methods: A Survey

A survey on advances in image captioning based on Deep Learning methods, including Encoder-Decoder structure, improved methods in Encoder,Improved methods in Decoder, and other improvements is presented.

Cross-Modal Representation

This chapter first introduces typical cross-modal representation models, and reviews several real-world applications related to cross- modal representation learning including image captioning, visual relation detection, and visual question answering.



Contrastive Learning for Image Captioning

This work proposes a new learning method, Contrastive Learning (CL), for image captioning, which via two constraints formulated on top of a reference model can encourage distinctiveness, while maintaining the overall quality of the generated captions.

Boosting Image Captioning with Attributes

This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.

Towards Diverse and Natural Image Descriptions via a Conditional GAN

A new framework based on Conditional Generative Adversarial Networks (CGAN) is proposed, which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Attention Correctness in Neural Image Captioning

It is shown on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram

Deep Visual-Semantic Alignments for Generating Image Descriptions

  • A. KarpathyLi Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

Review Networks for Caption Generation

The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder.