MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

@article{Xu2016MSRVTTAL,
  title={MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
  author={Jun Xu and Tao Mei and Ting Yao and Yong Rui},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2016},
  pages={5288-5296}
}
  • Jun Xu, Tao Mei, Y. Rui
  • Published 5 January 2016
  • Computer Science
  • 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. [] Key Result We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling…

Figures and Tables from this paper

LoL-V2T: Large-Scale Esports Video Description Dataset
TLDR
The dataset, which the authors call LoL-V2T, is the largest video description dataset in the video game domain, and includes 9,723 clips with 62,677 captions, and the masking can significantly improve performance.
Video Captioning using Deep Learning: An Overview of Methods, Datasets and Metrics
  • M. Amaresh, S. Chitrakala
  • Computer Science
    2019 International Conference on Communication and Signal Processing (ICCSP)
  • 2019
TLDR
This survey discusses various methods using the end-to-end framework of encoder-decoder network based on deep learning approaches to generate the natural language description for video sequences.
Multirate Multimodal Video Captioning
TLDR
The approach for video captioning gets great performance on the 2nd MSR Video to Language Challenge and the approach utilizes a Multirate GRU to capture temporal structure of videos.
Video Description
TLDR
The state-of-the-art approaches with a focus on deep learning models are surveyed; benchmark datasets are compared in terms of their domains, number of classes, and repository size; and various evaluation metrics are identified, such as SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD.
CLIP4Caption: CLIP for Video Caption
TLDR
A CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM) and adopts a Transformer structured decoder network to effectively learn the long-range visual and language dependency.
A Comprehensive Review on Recent Methods and Challenges of Video Description
TLDR
This work reports a comprehensive survey on the phases of video description approaches, the dataset for video description, evaluation metrics, open competitions for motivating the research on the videodescription, open challenges in this field, and future research directions.
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
TLDR
This model is based on the encoder--decoder pipeline, popular in image and video captioning systems, and it is argued that this approach is better suited for the current video captioned task, compared to using a single model, due to the diversity in the dataset.
Boosting Video Description Generation by Explicitly Translating from Frame-Level Captions
TLDR
This paper proposes a novel sequence to sequence architecture to generate descriptions for videos, in a sense that the inputs are the captions of sequential frames and it outputs words sequentially.
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
TLDR
This work presents a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese and demonstrates that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
TLDR
This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 61 REFERENCES
Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research
TLDR
An automatic DVS segmentation and alignment method for movies is described, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention.
C3D: Generic Features for Video Analysis
TLDR
Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.
Sequence to Sequence -- Video to Text
TLDR
A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.
Describing Videos by Exploiting Temporal Structure
TLDR
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
TLDR
This paper proposes a hybrid system consisting of a low level multimodal latent topic model for initial keyword annotation, a middle level of concept detectors and a high level module to produce final lingual descriptions that captures the most relevant contents of a video in a natural language description.
A dataset for Movie Description
TLDR
Comparing ADs to scripts, it is found that ADs are far more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
TLDR
This work combines the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video, and shows that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
...
1
2
3
4
5
...