• Corpus ID: 235377363

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

@article{Li2021VALUEAM,
  title={VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation},
  author={Linjie Li and Jie Lei and Zhe Gan and Licheng Yu and Yen-Chun Chen and Rohith Krishnan Pillai and Yu Cheng and Luowei Zhou and Xin Wang and William Yang Wang and Tamara L. Berg and Mohit Bansal and Jingjing Liu and Lijuan Wang and Zicheng Liu},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.04632}
}
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video… 
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
TLDR
A multi-level alignment training scheme to directly shape the encoding process for video-and-language grounding tasks and achieved comparable performance as previous state-of-the-arts on multiple video QA and retrieval datasets.
Revisiting the "Video" in Video-Language Understanding
TLDR
The atemporal probe is proposed, a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding, and effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
TLDR
The VLUE benchmark is introduced, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models, and it is demonstrated that there is a sizable generalization gap for all V LP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures.
A CLIP-Enhanced Method for Video-Language Understanding
TLDR
A CLIP-Enhanced method to incorporate the image-text pretrained knowledge into downstream video-text tasks and outperforms the state-of-the-art by 2.4% on the VALUE benchmark.
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
TLDR
This work presents S WIN BERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description that can adapt to variable lengths of video input without dedicated design for different frame rates.
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
TLDR
A multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples is proposed which achieves considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies.
Video Question Answering with Iterative Video-Text Co-Tokenization
TLDR
A novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
TLDR
This work builds on frozen bidirectional language models (BiLM) and shows that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA and demonstrates competitive performance in the few-shot and fully-supervised setting.
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
TLDR
Surprisingly, experimental results show that this unified VidL framework LAVENDER achieves competitive performance on 14 VidL benchmarks, covering video question answering, textto-video retrieval and video captioning.
Multimodal Learning with Transformers: A Survey
TLDR
A comprehensive survey of Transformer techniques oriented at multimodal data and a discussion of open problems and potential research directions for the community are presented.
...
...

References

SHOWING 1-10 OF 80 REFERENCES
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
TLDR
This work presents a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese and demonstrates that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation.
Towards Automatic Learning of Procedures From Web Instructional Videos
TLDR
A segment-level recurrent network is proposed for generating procedure segments by modeling the dependencies across segments and it is shown that the proposed model outperforms competitive baselines in procedure segmentation.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
What Is More Likely to Happen Next? Video-and-Language Future Event Prediction
TLDR
This work collects a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips, and presents a strong baseline incorporating information from video, dialogue, and commonsense knowledge.
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
TLDR
The proposed XML model uses a late fusion design with a novel Convolutional Start-End detector (ConvSE), surpassing baselines by a large margin and with better efficiency, providing a strong starting point for future work.
TVQA: Localized, Compositional Video Question Answering
TLDR
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
TLDR
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
TLDR
HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.
...
...