Ego4D: Around the World in 3,000 Hours of Egocentric Video
@article{Grauman2021Ego4DAT, title={Ego4D: Around the World in 3,000 Hours of Egocentric Video}, author={Kristen Grauman and Andrew Westbury and Eugene Byrne and Zachary Q. Chavis and Antonino Furnari and Rohit Girdhar and Jackson Hamburger and Hao Jiang and Miao Liu and Xingyu Liu and Miguel Martin and Tushar Nagarajan and Ilija Radosavovic and Santhosh K. Ramakrishnan and Fiona Ryan and Jayant Sharma and Michael Wray and Mengmeng Xu and Eric Z. Xu and Chen Zhao and Siddhant Bansal and Dhruv Batra and Vincent Cartillier and Sean Crane and Tien Do and Morrie Doulaty and Akshay Erapalli and Christoph Feichtenhofer and Adriano Fragomeni and Qichen Fu and Christian Fuegen and Abrham Gebreselasie and Cristina Gonz{\'a}lez and James M. Hillis and Xuhua Huang and Yifei Huang and Wenqi Jia and Weslie Yu Heng Khoo and J{\'a}chym Kol{\'a}r and Satwik Kottur and Anurag Kumar and Federico Landini and Chao Li and Yanghao Li and Zhenqiang Li and Karttikeya Mangalam and Raghava Modhugu and Jonathan Munro and Tullie Murrell and Takumi Nishiyasu and Will Price and Paola Ruiz Puentes and Merey Ramazanova and Leda Sari and Kiran K. Somasundaram and Audrey Southerland and Yusuke Sugano and Ruijie Tao and Minh Vo and Yuchen Wang and Xindi Wu and Takuma Yagi and Yunyi Zhu and Pablo Arbel{\'a}ez and David J. Crandall and Dima Damen and Giovanni Maria Farinella and Bernard Ghanem and Vamsi Krishna Ithapu and C. V. Jawahar and Hanbyul Joo and Kris Kitani and Haizhou Li and Richard A. Newcombe and Aude Oliva and Hyun Soo Park and James M. Rehg and Yoichi Sato and Jianbo Shi and Mike Zheng Shou and Antonio Torralba and Lorenzo Torresani and Mingfei Yan and Jitendra Malik}, journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={18973-18990} }
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically…
Figures and Tables from this paper
figure 1 figure 2 table 2 figure 3 table 3 figure 4 table 4 figure 5 table 5 figure 6 table 6 figure 7 table 7 figure 8 table 8 figure 9 table 9 figure 10 table 10 figure 11 table 11 figure 12 table 12 figure 13 table 13 figure 14 table 14 figure 15 table 15 figure 16 table 16 figure 17 table 17 figure 18 table 18 figure 19 table 19 figure 20 table 20 figure 21 table 21 figure 22 table 22 figure 23 table 23 figure 24 table 24 figure 25 table 25 table 26 figure 27 figure 28 table 28 figure 29 table 29 figure 30 table 30 figure 31 table 31 figure 32 table 32 figure 33 table 33 figure 34 figure 35 figure 36 figure 37 figure 38 figure 39 figure 40 figure 41 figure 42 figure 43 figure 44 figure 45 figure 46 figure 47 figure 48 figure 49 figure 50 figure 51 figure 52 figure 53 figure 54 figure 56 figure 57 figure 59 figure 60 figure 61 figure 62 figure 63
149 Citations
4D Human Body Capture from Egocentric Video via 3D Scene Grounding
- Computer Science2021 International Conference on 3D Vision (3DV)
- 2021
This work proposes a simple yet effective optimization-based approach that leverages 2D observations of the entire video sequence and human-scene interaction constraint to estimate second-person human poses, shapes, and global motion that are grounded on the 3D environment captured from the egocentric view.
EgoBody: Human Body Shape, Motion and Social Interactions from Head-Mounted Devices
- Computer ScienceArXiv
- 2021
This work collects 68 sequences, spanning diverse sociological interaction categories, and proposes the first benchmark for 3D full-body pose and shape estimation from egocentric views, EgoBody, a novel large-scale dataset for social interactions in complex 3D scenes.
NeuralDiff: Segmenting 3D objects that move in egocentric videos
- Computer Science2021 International Conference on 3D Vision (3DV)
- 2021
It is demonstrated that the method can successfully separate the different types of motion, outperforming recent neural rendering baselines at this task, and can accurately segment the moving objects.
E2(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper introduces N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC- Kitchens dataset, and proposes two strategies: directly processing event-camera data with traditional video-processing architectures (E2(GO)) and using event-data to distill optical flow information (E 2(GO)MO).
Shaping embodied agent behavior with activity-context priors from egocentric video
- Computer Science
- 2021
This work introduces an approach to discover activitycontext priors from in-the-wild egocentric video captured with human worn cameras, encoding the video-based prior as an auxiliary reward function that encourages an agent to bring compatible objects together before attempting an interaction.
Sustainable AI: Environmental Implications, Challenges and Opportunities
- Computer ScienceMLSys
- 2022
The carbon footprint of AI computing is characterized by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware.
AVATAR submission to the Ego4D AV Transcription Challenge
- Computer ScienceArXiv
- 2022
This pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images that achieves a WER of 68.40 on the challenge test set, outper-forming the baseline by 43.7%, and winning the challenge.
Where a Strong Backbone Meets Strong Features - ActionFormer for Ego4D Moment Queries Challenge
- Computer ScienceArXiv
- 2022
This report describes the submission to the Ego4D Moment Queries Challenge 2022, which builds on ActionFormer, the state-of-the-art backbone for temporal action localization, and a trio of strong video features from SlowFast, Omnivore and EgoVLP.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
- Computer ScienceArXiv
- 2022
This work shows that model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue – in which new multimodal tasks are formulated as a guided language- based exchange between different pre-existing foundation models, without additional language-based exchange.
Estimating more camera poses for ego-centric videos is essential for VQ3D
- Computer ScienceArXiv
- 2022
A new pipeline is designed for the challenging egocentric video camera pose estimation problem and the current VQ3D framework is revisited and optimized in terms of performance and efficiency.
References
SHOWING 1-10 OF 221 REFERENCES
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
- Computer ScienceArXiv
- 2018
Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.
The Kinetics Human Action Video Dataset
- Computer ScienceArXiv
- 2017
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
The "Ego-Exo" framework can be seamlessly integrated into standard video models; it outperforms all baselines when fine-tuned for egocentric activity recognition, achieving state-of-the-art results on Charades-Ego and EPIC-Kitchens-100.
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work proposes a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets, and demonstrates the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.
You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
It is shown that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view.
In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video
- Computer ScienceECCV
- 2018
A novel deep model is proposed for joint gaze estimation and action recognition in First Person Vision that describes the participant’s gaze as a probabilistic variable and models its distribution using stochastic units in a deep network to generate an attention map.
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
- Computer ScienceArXiv
- 2018
This paper introduces EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments, and had the participants narrate their own videos (after recording), thus reflecting true intention, and crowd-sourced ground-truths based on these.
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.
Social interactions: A first-person perspective
- Psychology2012 IEEE Conference on Computer Vision and Pattern Recognition
- 2012
Encouraging results on detection and recognition of social interactions in first-person videos captured from multiple days of experience in amusement parks are demonstrated.