Ego4D: Around the World in 3,000 Hours of Egocentric Video

  title={Ego4D: Around the World in 3,000 Hours of Egocentric Video},
  author={Kristen Grauman and Andrew Westbury and Eugene Byrne and Zachary Q. Chavis and Antonino Furnari and Rohit Girdhar and Jackson Hamburger and Hao Jiang and Miao Liu and Xingyu Liu and Miguel Martin and Tushar Nagarajan and Ilija Radosavovic and Santhosh K. Ramakrishnan and Fiona Ryan and Jayant Sharma and Michael Wray and Mengmeng Xu and Eric Z. Xu and Chen Zhao and Siddhant Bansal and Dhruv Batra and Vincent Cartillier and Sean Crane and Tien Do and Morrie Doulaty and Akshay Erapalli and Christoph Feichtenhofer and Adriano Fragomeni and Qichen Fu and Christian Fuegen and Abrham Gebreselasie and Cristina Gonz{\'a}lez and James M. Hillis and Xuhua Huang and Yifei Huang and Wenqi Jia and Weslie Yu Heng Khoo and J{\'a}chym Kol{\'a}r and Satwik Kottur and Anurag Kumar and Federico Landini and Chao Li and Yanghao Li and Zhenqiang Li and Karttikeya Mangalam and Raghava Modhugu and Jonathan Munro and Tullie Murrell and Takumi Nishiyasu and Will Price and Paola Ruiz Puentes and Merey Ramazanova and Leda Sari and Kiran K. Somasundaram and Audrey Southerland and Yusuke Sugano and Ruijie Tao and Minh Vo and Yuchen Wang and Xindi Wu and Takuma Yagi and Yunyi Zhu and Pablo Arbel{\'a}ez and David J. Crandall and Dima Damen and Giovanni Maria Farinella and Bernard Ghanem and Vamsi Krishna Ithapu and C. V. Jawahar and Hanbyul Joo and Kris Kitani and Haizhou Li and Richard A. Newcombe and Aude Oliva and Hyun Soo Park and James M. Rehg and Yoichi Sato and Jianbo Shi and Mike Zheng Shou and Antonio Torralba and Lorenzo Torresani and Mingfei Yan and Jitendra Malik},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically… 

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

This work proposes a simple yet effective optimization-based approach that leverages 2D observations of the entire video sequence and human-scene interaction constraint to estimate second-person human poses, shapes, and global motion that are grounded on the 3D environment captured from the egocentric view.

EgoBody: Human Body Shape, Motion and Social Interactions from Head-Mounted Devices

This work collects 68 sequences, spanning diverse sociological interaction categories, and proposes the first benchmark for 3D full-body pose and shape estimation from egocentric views, EgoBody, a novel large-scale dataset for social interactions in complex 3D scenes.

NeuralDiff: Segmenting 3D objects that move in egocentric videos

It is demonstrated that the method can successfully separate the different types of motion, outperforming recent neural rendering baselines at this task, and can accurately segment the moving objects.

E2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition

This paper introduces N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC- Kitchens dataset, and proposes two strategies: directly processing event-camera data with traditional video-processing architectures (E2(GO)) and using event-data to distill optical flow information (E 2(GO)MO).

Shaping embodied agent behavior with activity-context priors from egocentric video

This work introduces an approach to discover activitycontext priors from in-the-wild egocentric video captured with human worn cameras, encoding the video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction.

Sustainable AI: Environmental Implications, Challenges and Opportunities

The carbon footprint of AI computing is characterized by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware.

AVATAR submission to the Ego4D AV Transcription Challenge

This pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images that achieves a WER of 68.40 on the challenge test set, outper-forming the baseline by 43.7%, and winning the challenge.

Where a Strong Backbone Meets Strong Features - ActionFormer for Ego4D Moment Queries Challenge

This report describes the submission to the Ego4D Moment Queries Challenge 2022, which builds on ActionFormer, the state-of-the-art backbone for temporal action localization, and a trio of strong video features from SlowFast, Omnivore and EgoVLP.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

This work shows that model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue – in which new multimodal tasks are formulated as a guided language- based exchange between different pre-existing foundation models, without additional language-based exchange.

Estimating more camera poses for ego-centric videos is essential for VQ3D

A new pipeline is designed for the challenging egocentric video camera pose estimation problem and the current VQ3D framework is revisited and optimized in terms of performance and efficiency.



HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.

The Kinetics Human Action Video Dataset

The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

The "Ego-Exo" framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

This work proposes a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets, and demonstrates the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.

You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

It is shown that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view.

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

A novel deep model is proposed for joint gaze estimation and action recognition in First Person Vision that describes the participant’s gaze as a probabilistic variable and models its distribution using stochastic units in a deep network to generate an attention map.

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

This paper introduces EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments, and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these.

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

Social interactions: A first-person perspective

Encouraging results on detection and recognition of social interactions in first-person videos captured from multiple days of experience in amusement parks are demonstrated.