TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

@article{Lei2020TVRAL,
  title={TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval},
  author={Jie Lei and Licheng Yu and Tamara L. Berg and M. Bansal},
  journal={ArXiv},
  year={2020},
  volume={abs/2001.09099}
}
We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window. The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth… Expand
12 Citations
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
  • Highly Influenced
  • PDF
Video Understanding as Machine Translation
  • 4
  • Highly Influenced
  • PDF
Violin: A Large-Scale Dataset for Video-and-Language Inference
  • 8
  • PDF
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
  • 27
  • PDF
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
  • 6
  • PDF
A Survey of Temporal Activity Localization via Language in Untrimmed Videos
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
  • Highly Influenced
  • PDF
On Semantic Similarity in Video Retrieval
  • PDF
...
1
2
...

References

SHOWING 1-10 OF 52 REFERENCES
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
  • 453
  • PDF
ExCL: Extractive Clip Localization Using Natural Language Descriptions
  • 25
  • Highly Influential
  • PDF
Localizing Moments in Video with Natural Language
  • 193
  • Highly Influential
  • PDF
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
  • 59
  • PDF
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
  • 58
  • Highly Influential
  • PDF
Temporal Localization of Moments in Video Collections with Natural Language
  • 13
  • Highly Influential
  • PDF
DeepStory: Video Story QA by Deep Embedded Memory Networks
  • 74
  • PDF
TVQA+: Spatio-Temporal Grounding for Video Question Answering
  • 45
  • PDF
...
1
2
3
4
5
...