Rethinking the Form of Latent States in Image Captioning
@inproceedings{Dai2018RethinkingTF, title={Rethinking the Form of Latent States in Image Captioning}, author={Bo Dai and Deming Ye and Dahua Lin}, booktitle={European Conference on Computer Vision}, year={2018} }
RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant…
13 Citations
A Neural Compositional Paradigm for Image Captioning
- Computer ScienceNeurIPS
- 2018
This paper presents an alternative paradigm for image captioning, which factorizes the captioning procedure into two stages: extracting an explicit semantic representation from the given image and constructing the caption based on a recursive compositional procedure in a bottom-up manner.
Macroscopic Control of Text Generation for Image Captioning
- Computer ScienceArXiv
- 2021
A control signal is introduced which can control the macroscopic sentence attributes, such as sentence quality, sentence length, sentence tense and number of nouns etc, and a strategy is innovatively proposed that an image-text matching model is trained to measure the quality of sentences generated in both forward and backward directions and finally choose the better one.
Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
A Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information and a cross-modal attention mechanism to retouch the two sentences by fusing their salient parts as well as the salient areas of the image.
From Show to Tell: A Survey on Deep Learning-Based Image Captioning
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2023
This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics, and quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies.
Structure Preserving Convolutional Attention for Image Captioning
- Computer ScienceApplied Sciences
- 2019
A convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps by aiming to determine the intended regions to describe the image along the spatial and channel dimensions.
Panoptic Segmentation-Based Attention for Image Captioning
- Computer Science
- 2020
This work proposes panoptic segmentation-based attention that performs attention at a mask-level (i.e., the shape of the main part of an instance) and extracts feature vectors from the corresponding segmentation regions, which is more fine-grained than current attention mechanisms.
Sequential image encoding for vision-to-language problems
- Computer ScienceMultimedia Tools and Applications
- 2019
Experimental results on the image captioning and VQA benchmarks demonstrate the hypothesis it’s beneficial to appropriately arrange objects sequence on the Vision-to-Language (V2L) problems.
Multi-Modal fusion with multi-level attention for Visual Dialog
- Computer ScienceInf. Process. Manag.
- 2020
Image Captioning based on Deep Learning Methods: A Survey
- Computer ScienceArXiv
- 2019
A survey on advances in image captioning based on Deep Learning methods, including Encoder-Decoder structure, improved methods in Encoder,Improved methods in Decoder, and other improvements is presented.
Cross-Modal Representation
- Biology, Computer ScienceRepresentation Learning for Natural Language Processing
- 2020
This chapter first introduces typical cross-modal representation models, and reviews several real-world applications related to cross- modal representation learning including image captioning, visual relation detection, and visual question answering.
References
SHOWING 1-10 OF 43 REFERENCES
Contrastive Learning for Image Captioning
- Computer ScienceNIPS
- 2017
This work proposes a new learning method, Contrastive Learning (CL), for image captioning, which via two constraints formulated on top of a reference model can encourage distinctiveness, while maintaining the overall quality of the generated captions.
Boosting Image Captioning with Attributes
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Towards Diverse and Natural Image Descriptions via a Conditional GAN
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
A new framework based on Conditional Generative Adversarial Networks (CGAN) is proposed, which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content.
Show and tell: A neural image caption generator
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Attention Correctness in Neural Image Captioning
- Computer ScienceAAAI
- 2017
It is shown on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.
SPICE: Semantic Propositional Image Caption Evaluation
- Computer ScienceECCV
- 2016
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram…
Deep Visual-Semantic Alignments for Generating Image Descriptions
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Computer ScienceICML
- 2015
An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Review Networks for Caption Generation
- Computer ScienceNIPS
- 2016
The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review step; the thought vectors are used as the input of the attention mechanism in the decoder.