From Show to Tell: A Survey on Deep Learning-based Image Captioning.

  title={From Show to Tell: A Survey on Deep Learning-based Image Captioning.},
  author={Matteo Stefanini and Marcella Cornia and Lorenzo Baraldi and Silvia Cascianelli and Giuseppe Fiameni and Rita Cucchiara},
  journal={IEEE transactions on pattern analysis and machine intelligence},
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions… 
A Frustratingly Simple Approach for End-to-End Image Captioning
This work proposes a frustratingly simple but highly effective end-to-end image captioning framework, Visual Conditioned GPT (VC-GPT), by connecting the pre-trained visual encoder (CLIP-ViT) and language decoder (GPT2) and coming up with a self-ensemble cross-modal fusion mechanism that comprehensively considers both the single and cross- modal knowledge.
CaMEL: Mean Teacher Learning for Image Captioning
CaMEL, a novel Transformer-based architecture for image captioning that leverages the interaction of two interconnected language models that learn from each other during the training phase, achieves a new state of the art on COCO when training without using external data.
Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training
This paper proposes a model which induces a separation between content and descriptive style through the incorporation of stylistic parameters and keywords extracted from large-scale multimodal models as pivotal data, and consistently outperform existing methods in terms of caption quality and capability of describing out-of-domain concepts.


StructCap: Structured Semantic Embedding for Image Captioning
The proposed StructCap model parses a given image into key entities and their relations organized in a visual parsing tree, which is transformed and embedded under an encoder-decoder framework via visual attention.
Exploring Visual Relationship for Image Captioning
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Towards Diverse and Natural Image Descriptions via a Conditional GAN
A new framework based on Conditional Generative Adversarial Networks (CGAN) is proposed, which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content.
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words
This paper proposes Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations, and proposes Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction.
Boosting Image Captioning with Attributes
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Reflective Decoding Network for Image Captioning
It is shown that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image captioning, and the proposed Reflective Decoding Network (RDN) enhances both the long-sequence dependency and position perception of words in a caption decoder.
Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention
This work proposes an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on whichParts of the image are salient and which are contextual.
Convolutional Image Captioning
This paper develops a convolutional image captioning technique that demonstrates efficacy on the challenging MSCOCO dataset and demonstrates performance on par with the LSTM baseline, while having a faster training time per number of parameters.