Corpus ID: 220486849

AViD Dataset: Anonymized Videos from Diverse Countries

  title={AViD Dataset: Anonymized Videos from Diverse Countries},
  author={A. J. Piergiovanni and Michael S. Ryoo},
We introduce a new public video dataset for action recognition: Anonymized Videos from Diverse countries (AViD). Unlike existing public video datasets, AViD is a collection of action videos from many different countries. The motivation is to create a public dataset that would benefit training and pretraining of action recognition models for everybody, rather than making it useful for limited countries. Further, all the face identities in the AViD videos are properly anonymized to protect their… Expand
A Study of Face Obfuscation in ImageNet
This paper investigates how face blurring—a typical obfuscation technique—impacts classification accuracy, and investigates how features learned on face-blurred images are equally transferable to 4 downstream tasks. Expand
A Comprehensive Study of Deep Video Action Recognition
A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. Expand
Video Action Understanding: A Tutorial
This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures. Expand
Activity Graph Transformer for Temporal Action Localization
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear inExpand
2D progressive fusion module for action recognition
2D Progressive Fusion Module is proposed which is inserted after the 2D backbone CNN layers and fuses features through a novel 2D convolution on the spatial and temporal dimensions called variation attenuating convolution and applies fusion techniques to improve the recognition accuracy and the convergency. Expand
Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus
This paper contributes a formal case study in retrospective dataset documentation and pinpoints several problems with the influential BookCorpus dataset. Recent work has underscored the importance ofExpand
Estimating (and fixing) the Effect of Face Obfuscation in Video Recognition
This paper investigates the role of face obfuscation in video classification datasets and proposes a generalized distillation approach in which a privacy-preserving action recognition network is trained with privileged information given by face identities to close the performance gap caused by face anonymization. Expand
RedCaps: Web-curated image-text data created by the people, for the people
Datasets of images and text have become increasingly popular for learning repre1 sentations that generalize to visual recognition and vision and language tasks. Prior 2 public datasets were built byExpand
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
A novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks, which accomplishes competitive results at significantly reduced compute amount. Expand
Video Action Understanding
This tutorial introduces and systematizes fundamental topics, basic concepts, and notable examples in supervised video action understanding, and clarifies a taxonomy of action problems, catalog and highlight video datasets, and formalize domain-specific metrics to baseline proposed solutions. Expand


Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced. Expand
C3D: Generic Features for Video Analysis
Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks. Expand
YouTube-8M: A Large-Scale Video Classification Benchmark
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset. Expand
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Expand
HMDB: A large video database for human motion recognition
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions. Expand
HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization
  • Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted. Expand
The Kinetics Human Action Video Dataset
The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given. Expand
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
  • C. Gu, Chen Sun, +8 authors J. Malik
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently. Expand