SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

  title={SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos},
  author={Adrien Deli{\`e}ge and Anthony Cioppa and Silvio Giancola and Meisam Jamshidi Seikavandi and Jacob Velling Dueholm and Kamal Nasrollahi and Bernard Ghanem and Thomas B. Moeslund and Marc Van Droogenbroeck},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
Understanding broadcast videos is a challenging task in computer vision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing. In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet [24] video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production. Specifically, we release around 300k annotations within SoccerNet’s 500 untrimmed broadcast… Expand
A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games
A universal taxonomy that covers a wide range of low and high-level events for invasion games and is exemplarily refined to soccer and handball is presented and twomulti-modal datasets comprising video and positional data with gold-standard annotations are released to foster research in fine-grained and ball-centered event spotting. Expand
Temporally-Aware Feature Pooling for Action Spotting in Soccer Broadcasts
A novel feature pooling method based on NetVLAD, dubbed netVLAD++, that embeds temporally-aware knowledge that split the context before and after an action occurs, and argues that considering the contextual information around the action spot as a single entity leads to a sub-optimal learning for the pooling module. Expand
Camera Calibration and Player Localization in SoccerNet-v2 and Investigation of their Representations for Action Spotting
This work distill a powerful commercial calibration tool in a recent neural network architecture on the largescale SoccerNet dataset, composed of untrimmed broadcast videos of 500 soccer games, and leverages it to provide 3 ways of representing the calibration results along with player localization. Expand
Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection
This tech report presents a two-stage paradigm to detect what and when events happen in soccer broadcast videos, fine-tune multiple action recognition models on soccer data to extract high-level semantic features, and design a transformer based temporal detection module to locate the target events. Expand
Video Action Understanding
This tutorial introduces and systematizes fundamental topics, basic concepts, and notable examples in supervised video action understanding, and clarifies a taxonomy of action problems, catalog and highlight video datasets, and formalize domain-specific metrics to baseline proposed solutions. Expand


SoccerDB: A Large-Scale Database for Comprehensive Video Understanding
This paper proposes a new soccer video database named SoccerDB, comprising 171,191 video segments from 346 high-quality soccer games, which is the largest database for comprehensive sports video understanding on various aspects. Expand
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
This paper introduces SoccerNet, a benchmark for action spotting in soccer videos, and shows that the best model for classifying temporal segments of length one minute reaches a mean Average Precision (mAP) of 67.8%. Expand
Improved Soccer Action Spotting using both Audio and Video Streams
This work used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues, and evaluated several ways to integrate audio stream into video-only-based architectures. Expand
A Context-Aware Loss Function for Action Spotting in Soccer Videos
This paper proposes a novel loss function that specifically considers the temporal context naturally present around each action, rather than focusing on the single annotated frame to spot, and demonstrates the generalization capability of this loss for generic activity proposals and detection on ActivityNet. Expand
Shot Classification and Replay Detection in Broadcast Soccer Video
This work has classified the frames of a broadcast soccer video into four classes, namely long shot, medium shot, close shot and logo frame, and proposed a model to detect replay within a soccer video. Expand
Event detection in coarsely annotated sports videos via parallel multi receptive field 1D convolutions
This work introduces a multi-tower temporal convolutional network architecture for the task of event detection in coarsely annotated videos and demonstrates the effectiveness of the multi-receptive field architecture through appropriate ablation studies. Expand
Soccer: Who Has the Ball? Generating Visual Analytics and Player Statistics
This paper proposes an approach that automatically generates visual analytics from videos specifically for soccer to help coaches and recruiters identify the most promising talents and compares it with state-of-the-art approaches. Expand
Bayesian Network-Based Customized Highlight Generation for Broadcast Soccer Videos
The proposed system for automatically generating the highlights from sports TV broadcasts detects exciting clips based on audio features and then classify the individual scenes within the clip into events such as replay, player, referee, spectator, and players gathering. Expand
Shot Classification in Broadcast Soccer Video
An effective hierarchical shot classification scheme for broadcast soccer video that partitions a video into replay and non-replay shots with replay logo detection and classification into Long, Medium, Close-up or Out-field types with color and texture features based on a decision tree is presented. Expand
The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
This paper details how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions, and introduces new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions. Expand