VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

@article{Wang2019VaTeXAL,
  title={VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research},
  author={Xin Eric Wang and Jiawei Wu and Junkun Chen and Lei Li and Y. Wang and William Yang Wang},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={4580-4590}
}
We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset, \vatex is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on \vatex… Expand
59 Citations
MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish
  • PDF
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
  • Highly Influenced
  • PDF
Violin: A Large-Scale Dataset for Video-and-Language Inference
  • 8
  • PDF
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
  • Mingyang Zhou, Luowei Zhou, +4 authors Jingjing Liu
  • Computer Science
  • ArXiv
  • 2021
  • PDF
RUC_AIM3 at TRECVID 2019: Video to Text
  • 1
  • PDF
Multi-attention mechanism for Chinese description of videos
  • Hu Liu, Junxiu Wu, Jiabin Yuan
  • 2020
Hybrid Space Learning for Language-based Video Retrieval
  • 1
  • Highly Influenced
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
  • 12
  • PDF
Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review
  • Highly Influenced
  • PDF
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 76 REFERENCES
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
  • 449
  • Highly Influential
  • PDF
Fluency-Guided Cross-Lingual Image Captioning
  • 28
  • PDF
Localizing Moments in Video with Natural Language
  • 193
  • PDF
Multi30K: Multilingual English-German Image Descriptions
  • 174
  • PDF
Sequence to Sequence -- Video to Text
  • 914
  • PDF
Incorporating Global Visual Features into Attention-based Neural Machine Translation
  • 73
  • PDF
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
  • 188
  • PDF
Movie Description
  • 128
  • PDF
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
  • 397
  • PDF
...
1
2
3
4
5
...