The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

  title={The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines},
  author={Dima Damen and Hazel Doughty and Giovanni Maria Farinella and Sanja Fidler and Antonino Furnari and Evangelos Kazakos and Davide Moltisanti and Jonathan Munro and Toby Perrett and Will Price and Michael Wray},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people’s interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant… 

Symbiotic Attention: UTS-Baidu Submission to the EPIC-Kitchens 2020 Action Recognition Challenge

The model ranked the first on both the seen and unseen test set on EPIC-Kitchens Action Recognition Challenge 2020, and incorporates multiple modality inputs, i.e., RGB frames and optical flows, to further improve the performance by a multi-modal fusion.

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Ego4D, a massive-scale egocentric video dataset and benchmark suite, is introduced and a host of new benchmark challenges centered around understanding the first-person visual experience in the past, present, and future are presented.

The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain

The MECCANO dataset is introduced, the first dataset of egocentric videos to study human-object interactions in industrial-like settings and is a revisited version of the standard human- object interaction detection task.

Egocentric Video Task Translation

This work proposes EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once, and shows its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.

SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

This work proposes SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production, and extends current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and a novel replay grounding task.

Egocentric Activity Recognition and Localization on a 3D Map

This work addresses the challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos, and proposes a novel deep probabilistic model that takes the inputs of a Hierarchical Volumetric Representation of the 3D environment and an Egocentric video, infers the3D action location as a latent variable and recognizes the action based on the video and contextual cues surrounding its potential locations.

Ego-Only: Egocentric Action Detection without Exocentric Pretraining

Ego-Only, the first training pipeline that enables state-of-the-art action detection on egocentric videos without any form of exocentric (third-person) pretraining, is presented.

FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2020 Challenge

The technical details of the submission to the EPIC-Kitchens Action Recognition 2020 Challenge are described, and an ensemble of GSM and EgoACO model families with different backbones and pre-training to generate the prediction scores are designed.

Multi-modal action segmentation in the kitchen with a feature fusion approach

This paper built the original dataset and frame-level annotation, and examined the usefulness of Action Segmentation using multi-modal features, and analyzed the effects of each modality using three evaluation metrics.

Towards Streaming Egocentric Action Anticipation

A lightweight action anticipation model consisting in a simple feed-forward 3D CNN, which is proposed to optimize using knowledge distillation techniques and a custom loss and shows that the proposed approach outperforms prior art in the streaming scenario, also in combination with other lightweight models.



Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.

Detecting activities of daily living in first-person camera views

This work presents a novel dataset and novel algorithms for the problem of detecting activities of daily living in firstperson camera views, and develops novel representations including temporal pyramids and composite object models that exploit the fact that objects look different when being interacted with.

SLAC: A Sparsely Labeled Dataset for Action Classification and Localization

The proposed procedure dramatically reduces the amount of human labeling by automatically identifying hard clips, i.e., clips that contain coherent actions but lead to prediction disagreement between action classifiers, thus generating labels for highly informative samples at little cost.

From Lifestyle Vlogs to Everyday Interactions

This work starts with a large collection of interaction-rich video data and then annotate and analyze it, and uses Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data.

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

It is demonstrated that disagreement stems from a limited understanding of the distinct phases of an action, and proposed annotating based on the Rubicon Boundaries, inspired by a similarly named cognitive model, for consistent temporal bounds of object interactions is proposed.

Ego-Topo: Environment Affordances From Egocentric Video

A model for environment affordances that is learned directly from egocentric video is introduced, to gain a human-centric model of a physical space that captures the primary spatial zones of interaction and the likely activities they support.

Action Recognition From Single Timestamp Supervision in Untrimmed Videos

This work proposes a method that is supervised by single timestamps located around each action instance, in untrimmed videos, that replaces expensive action bounds with sampling distributions initialised from these timestampeds, and demonstrates that these distributions converge to the location and extent of discriminative action segments.

Moments in Time Dataset: One Million Videos for Event Understanding

The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted.