Corpus ID: 236428439

Boosting Video Captioning with Dynamic Loss Network

  title={Boosting Video Captioning with Dynamic Loss Network},
  author={Nasibullah and P. P. Mohanta},
Video captioning is one of the challenging problems at the intersection of vision and language, having many real-life applications in video retrieval, video surveillance, assisting visually challenged people, Human-machine interface, and many more. Recent deep learning based methods [1,2,3] have shown promising results but still on the lower side than other vision tasks (such as image classification, object detection). A significant drawback with existing video captioning methods is that they… Expand

Figures and Tables from this paper


Reconstruction Network for Video Captioning
A reconstruction network with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning, and can boost the encoding models and leads to significant gains in video caption accuracy. Expand
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset. Expand
Memory-Attended Recurrent Network for Video Captioning
The Memory-Attended Recurrent Network (MARN) for video captioning is proposed, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data. Expand
Object Relational Graph With Teacher-Recommended Learning for Video Captioning
This paper proposes an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation and designs a teacher-recommended learning method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. Expand
Multimodal Video Description
This paper based their multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) model and extended it to take advantage of multiple modalities. Expand
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. Expand
Sequence to Sequence -- Video to Text
A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. Expand
Describing Videos by Exploiting Temporal Structure
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Expand
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning. Expand
Self-Critical Sequence Training for Image Captioning
This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Expand