Multi-Task Video Captioning with Video and Entailment Generation

  title={Multi-Task Video Captioning with Video and Entailment Generation},
  author={Ramakanth Pasunuru and Mohit Bansal},
Video captioning, the task of describing the content of a video, has seen some promising improvements in recent years with sequence-to-sequence models, but accurately learning the temporal and logical dynamics involved in the task still remains a challenge, especially given the lack of sufficient annotated data. [] Key Method For this, we present a many-to-many multi-task learning model that shares parameters across the encoders and decoders of the three tasks. We achieve significant improvements and the new…

Figures and Tables from this paper

Video Captioning via Hierarchical Reinforcement Learning

A novel hierarchical reinforcement learning framework for video captioning, where a high-level Manager module learns to design sub-goals and a low-level Worker module recognizes the primitive actions to fulfill the sub-goal.

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

A metric to compare the quality of semantic features is proposed, and appropriate features are utilized as input for a semantic detection network (SDN) with adequate complexity in order to generate meaningful semantic features for videos.

A Multitask Learning Encoder-Decoders Framework for Generating Movie and Video Captioning

This paper introduces a novel multitask encoder-decoder framework for generating automatic semantic description and captioning of video sequences and is the first method to use a multi-task approach for encoding video features.

Attentive Visual Semantic Specialized Network for Video Captioning

A new architecture that is an encoder-decoder model based on the Adaptive Attention Gate and Specialized LSTM layers, which can selectively decide when to use visual or semantic information into the text generation process and is capable of learning to improve the expressiveness of generated captions attending to their length.

Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning

This paper uniquely introduces a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model, and presents Retrival Augmented Convolutional Encoder-Decoder Networks (R-ConvED) that novelly integrates RAM into convolutional encoding structure to boost video captioning.

CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

Dual Attribute Prediction is introduced, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes.

MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

This paper introduces a novel multitask encoder-decoder framework for automatic semantic description and captioning of video sequences that relies on distinct decoders that train a visual encoder in a multitask fashion.

Delving Deeper into the Decoder for Video Captioning

A thorough investigation into the decoder of a video captioning model is made and a combination of variational dropout and layer normalization is embedded into a recurrent unit to alleviate the problem of overfitting.

Variational Stacked Local Attention Networks for Diverse Video Captioning

This work proposes Variational Stacked Local Attention Network (VSLAN), which exploits low-rank bilinear pooling for self-attentive feature interaction and stacking multiple video feature streams in a discount fashion to facilitate end-to-end diverse and natural captions without any explicit supervision on attributes.

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

The results show that the INP-based model is tricky to capture concepts’ semantics and sensitive to irrelevant background information, and the proposed Dual Concept Detection (DCD) further to inject concept knowledge into the model during training improves the caption quality and highlights the importance of concept-aware representation learning.



MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

Multi-task Sequence to Sequence Learning

The results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks, and reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context.

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.

Describing Videos by Exploiting Temporal Structure

This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.

Unsupervised Learning of Video Representations using LSTMs

This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild

This paper proposes a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics, and uses state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video.

YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.