Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

  title={Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning},
  author={Cheng Wang and Haojin Yang and Christoph Meinel},
  journal={ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)},
  pages={1 - 20}
  • Cheng Wang, Haojin Yang, C. Meinel
  • Published 25 April 2018
  • Computer Science
  • ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. [] Key Method We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent overfitting in training deep models.
Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation
Evaluation on widely used metrics have shown that the proposed Inception-ResNet Convolutional Neural Network as encoder, Hierarchical Context based Word Embeddings for word representations and a Deep Stacked Long Short Term Memory network as decoder leads to considerable improvement in model performance.
Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning
This work proposes an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning that computes attention both in forward and backward directions and achieves high performance comparable to state-of-the-art methods.
An encoder-decoder based framework for hindi image caption generation
An encoder-decoder based architecture is proposed where Convolutional Neural Network (CNN) is employed for encoding visual features of an image and stacked Long Short-Term Memory (sLSTM) in combination with both uni-directional LSTM and bi-irectional L STM for generating the captions in Hindi.
Text Augmentation Using BERT for Image Captioning
This work expands the training dataset using text augmentation methods and uses the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT) to show better results than models trained on a dataset without augmentation.
A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning
This work proposes a new approach to train decoders to regress the word embedding of the next word with respect to the previous ones instead of minimizing the log likelihood, and proposes a novel semantic attention mechanism that guides attention points through the image, taking the meaning of the previously generated word into account.
Delayed Combination of Feature Embedding in Bidirectional LSTM CRF for NER
The delayed combination model is compared with the own implementation of the early combination as well as the previous works to convince us that the delayed combination is more effective than the early one and also highly competitive.
Recall What You See Continually Using GridLSTM in Image Captioning
The experimental results clearly demonstrate that the recall network outperforms the conventional encoder–decoder model by a large margin and that it performs comparably to the state-of-the-art methods.
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
A pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN), consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi- modal reasoning and sentence generation via inter-modal interaction.
A Survey on Various Deep Learning Models for Automatic Image Captioning
Various end to end learning-based framework for image captioning using standard evaluation metric is studied to understand how can these frameworks be used for various research applications and futuristic challenges have also been discussed.
Dual-path Convolutional Image-Text Embeddings with Instance Loss
An end-to-end dual-path convolutional network to learn the image and text representations based on an unsupervised assumption that each image/text group can be viewed as a class, which allows the system to directly learn from the data and fully utilize the supervision.


Image Captioning with Deep Bidirectional LSTMs
This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning that builds on a deep convolutional neural network and two separate L STM networks and qualitatively analyzes how the models "translate" image to sentence.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Long-term recurrent convolutional networks for visual recognition and description
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Learning a Recurrent Visual Representation for Image Caption Generation
This paper uses a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction and evaluates the approach on several tasks.
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
A deep semantic framework for multimodal representation learning
A novel unified deep neural framework for multimodal representation learning to capture the high-level semantic correlations across modalities and achieves state-of-the-art results compare to both shallow and deep models in multimmodal and cross-modal retrieval.
DeepFont: Identify Your Font from An Image
This work builds up the first available large-scale VFR dataset, named AdobeVFR, consisting of both labeled synthetic data and partially labeled real-world data, and introduces a Convolutional Neural Network decomposition approach and a novel learning-based model compression approach in order to reduce the DeepFont model size without sacrificing its performance.
Image Captioning with Semantic Attention
This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.