The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)
@article{Albanie2020TheEA, title={The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)}, author={Samuel Albanie and Yang Liu and Arsha Nagrani and Antoine Miech and Ernesto Coto and Ivan Laptev and Rahul Sukthankar and Bernard Ghanem and Andrew Zisserman and Valentin Gabeur and Chen Sun and Alahari Karteek and Cordelia Schmid and Shizhe Chen and Yida Zhao and Qin Jin and Kaixu Cui and Hui Liu and Chen Wang and Yudong Jiang and Xiaoshuai Hao}, journal={ArXiv}, year={2020}, volume={abs/2008.00744} }
The organisers would like to express their gratitude to the creators of the original datasets used in this challenge. They would like to thank in particular Juan Carlos Niebles, Ranjay Krishna, Luowei Zhou, Lisa Ann Hendricks, Jun Xu, Tao Mei, Ting Yao, Yong Rui, David L. Chen, Bryan Russell and Anna Rohrbach for their assistance. We gratefully acknowledge the support of the Programme Grant Seebibyte EP/M013774/1.
5 Citations
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
Video-And-Language Understanding Evaluation (VALUE) benchmark is introduced, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning, which promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.
Model-agnostic Multi-Domain Learning with Domain-Specific Adapters for Action Recognition
- Computer ScienceIEICE Transactions on Information and Systems
- 2022
Experimental results on three popular action recognition datasets demonstrate that the proposed method is more effective than a multi-head architecture and more efficient than separately training models for each domain.
Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
- Computer ScienceICMR
- 2021
A novel Multi-Feature Graph ATtention Network (MFGATN) is proposed, which enriches the representation of each feature in videos with the interchange of high-level semantic information among them and elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra- modal structure constraint.
On Semantic Similarity in Video Retrieval
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This paper proposes several proxies to estimate semantic similarities in large-scale retrieval datasets, without additional annotations, and proposes a move to semantic similarity video retrieval, where multiple videos/captions can be deemed equally relevant, and their relative ranking does not affect a method’s reported performance.
Video-aided Unsupervised Grammar Induction
- Computer ScienceNAACL
- 2021
This paper investigates video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video, and proposes a Multi-Modal Compound PCFG model (MMC-PCFG), which outperforms each individual modality and previous state-of-the-art systems on three benchmarks.
References
SHOWING 1-10 OF 36 REFERENCES
Multi-modal Transformer for Video Retrieval
- Computer ScienceECCV
- 2020
A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Squeeze-and-Excitation Networks
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2020
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Dense-Captioning Events in Videos
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
- Computer ScienceArXiv
- 2018
This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.
Use What You Have: Video retrieval using representations from collaborative experts
- Computer ScienceBMVC
- 2019
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
Localizing Moments in Video with Natural Language
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
Learning multiple visual domains with residual adapters
- Computer ScienceNIPS
- 2017
This paper develops a tunable deep network architecture that, by means of adapter residual modules, can be steered on the fly to diverse visual domains and introduces the Visual Decathlon Challenge, a benchmark that evaluates the ability of representations to capture simultaneously ten very differentVisual domains and measures their ability to recognize well uniformly.