• Corpus ID: 220936376

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

  title={The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)},
  author={Samuel Albanie and Yang Liu and Arsha Nagrani and Antoine Miech and Ernesto Coto and Ivan Laptev and Rahul Sukthankar and Bernard Ghanem and Andrew Zisserman and Valentin Gabeur and Chen Sun and Alahari Karteek and Cordelia Schmid and Shizhe Chen and Yida Zhao and Qin Jin and Kaixu Cui and Hui Liu and Chen Wang and Yudong Jiang and Xiaoshuai Hao},
The organisers would like to express their gratitude to the creators of the original datasets used in this challenge. They would like to thank in particular Juan Carlos Niebles, Ranjay Krishna, Luowei Zhou, Lisa Ann Hendricks, Jun Xu, Tao Mei, Ting Yao, Yong Rui, David L. Chen, Bryan Russell and Anna Rohrbach for their assistance. We gratefully acknowledge the support of the Programme Grant Seebibyte EP/M013774/1. 

Figures and Tables from this paper

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Video-And-Language Understanding Evaluation (VALUE) benchmark is introduced, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning, which promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.

Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval

A novel Multi-Feature Graph ATtention Network (MFGATN) is proposed, which enriches the representation of each feature in videos with the interchange of high-level semantic information among them and elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra- modal structure constraint.

On Semantic Similarity in Video Retrieval

This paper proposes several proxies to estimate semantic similarities in large-scale retrieval datasets, without additional annotations, and proposes a move to semantic similarity video retrieval, where multiple videos/captions can be deemed equally relevant, and their relative ranking does not affect a method’s reported performance.

Video-aided Unsupervised Grammar Induction

This paper investigates video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video, and proposes a Multi-Modal Compound PCFG model (MMC-PCFG), which outperforms each individual modality and previous state-of-the-art systems on three benchmarks.

Model-agnostic Multi-Domain Learning with Domain-Specific Adapters for Action Recognition

Experimental results on three popular action recognition datasets demonstrate that the proposed method is more effective than a multi-head architecture and more efficient than separately training models for each domain.



End-to-End Learning of Visual Representations From Uncurated Instructional Videos

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Multi-modal Transformer for Video Retrieval

A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

Squeeze-and-Excitation Networks

This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.

Dense-Captioning Events in Videos

This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.

Use What You Have: Video retrieval using representations from collaborative experts

This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.

Localizing Moments in Video with Natural Language

The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.