Moments in Time Dataset: One Million Videos for Event Understanding

@article{Monfort2020MomentsIT,
  title={Moments in Time Dataset: One Million Videos for Event Understanding},
  author={Mathew Monfort and Bolei Zhou and Sarah Adel Bargal and Alex Andonian and Tom Yan and Kandan Ramakrishnan and Lisa M. Brown and Quanfu Fan and Dan Gutfreund and Carl Vondrick and Aude Oliva},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2020},
  volume={42},
  pages={502-508}
}
We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds. Modeling the spatial-audio-temporal dynamics even for actions occurring in 3 second videos poses many challenges: meaningful events do not include only people, but also objects, animals, and natural phenomena; visual and auditory events can be symmetrical in time (“opening” is “closing” in reverse), and either transient or… Expand
Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding
TLDR
This work augmented the existing video dataset, Moments in Time, to include over two million action labels for over one million three second videos, and introduces novel challenges on how to train and analyze models for multi-action detection. Expand
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
TLDR
The Spoken Moments (S-MiT) dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events is presented and a novel Adaptive Mean Margin (AMM) approach to contrastive learning is presented to evaluate the authors' models on video/caption retrieval on multiple datasets. Expand
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. Expand
Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset
TLDR
This work uses state of the art techniques for visual, auditory and spatio temporal localization and develop method to accurately classify the activity in the Moments in Time dataset, providing a significant improvement over the Baseline TRN model. Expand
A Large Scale Multi-Label Action Dataset for Video Understanding
TLDR
A multilabel extension to the Moments in Time Dataset is presented which includes annotation of multiple actions in each video and a baseline analysis is performed to compare recognition results, class selectivity, and network robustness of a temporal relation network (TRN) trained on both single-label Moments in time and the proposed multi-label extension. Expand
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
TLDR
This paper identifies action classes where temporal information is actually necessary to recognize them and call these "temporal classes", and proposes a methodology based on a simple and effective human annotation experiment that leads to better generalization in unseen classes, demonstrating the need for more temporal data. Expand
VideoGraph: Recognizing Minutes-Long Human Activities in Videos
TLDR
The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation, and it is demonstrated that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos. Expand
STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection
TLDR
A high-level video understanding module which can encode interactions between actors and objects both in space and time is developed which can outperform or bring performances comparable to state-of-the-art models which require heavy end-to-end and synchronized training on multiple GPUs. Expand
MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity Detection
TLDR
This work presents the Multiview Extended Video with Activities (MEVA) dataset, a new and very-large-scale dataset for human activity recognition, scripted to include diverse, simultaneous activities, along with spontaneous background activity. Expand
Trajectory Convolution for Action Recognition
TLDR
This work proposes a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 75 REFERENCES
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. Expand
Temporal Relational Reasoning in Videos
TLDR
This paper introduces an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. Expand
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers. Expand
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset. Expand
The Open World of Micro-Videos
TLDR
A novel dataset of micro-videos labeled with 58 thousand tags is analyzed, which introduces viewpoint-specific and temporally-evolving models for video understanding, defined over state-of-the-art motion and deep visual features. Expand
The Kinetics Human Action Video Dataset
TLDR
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given. Expand
ActivityNet: A large-scale video benchmark for human activity understanding
TLDR
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living. Expand
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced. Expand
Actions as space-time shapes
TLDR
The method is fast, does not require video alignment and is applicable in many scenarios where the background is known, and the robustness of the method is demonstrated to partial occlusions, non-rigid deformations, significant changes in scale and viewpoint, high irregularities in the performance of an action and low quality video. Expand
Detecting activities of daily living in first-person camera views
TLDR
This work presents a novel dataset and novel algorithms for the problem of detecting activities of daily living in firstperson camera views, and develops novel representations including temporal pyramids and composite object models that exploit the fact that objects look different when being interacted with. Expand
...
1
2
3
4
5
...