• Corpus ID: 220486849

AViD Dataset: Anonymized Videos from Diverse Countries

  title={AViD Dataset: Anonymized Videos from Diverse Countries},
  author={A. J. Piergiovanni and Michael S. Ryoo},
We introduce a new public video dataset for action recognition: Anonymized Videos from Diverse countries (AViD). Unlike existing public video datasets, AViD is a collection of action videos from many different countries. The motivation is to create a public dataset that would benefit training and pretraining of action recognition models for everybody, rather than making it useful for limited countries. Further, all the face identities in the AViD videos are properly anonymized to protect their… 

A Study of Face Obfuscation in ImageNet

The effects of face obfuscation on the popular ImageNet challenge visual recognition benchmark is explored, the feasibil-ity of privacy-aware visual recognition is demonstrated, the highly-used Image net challenge benchmark is improved, and an important path for future visual datasets is suggested.

A Comprehensive Study of Deep Video Action Recognition

A comprehensive survey of over 200 existing papers on deep learning for video action recognition is provided, starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models.

Video Action Understanding: A Tutorial

This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures.

Activity Graph Transformer for Temporal Action Localization

An end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video, that outperforms the state-of-the-art by a considerable margin.

A Review of Deep Learning for Video Captioning

This survey covers deep learning-based VC, including but, not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC), and more.

A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition

This survey provides recent convolution-free-based methods which replaced convolution networks with the transformer networks that achieved state-of-the-art results on many human action recognition datasets.

Ethical Considerations for Collecting Human-Centric Image Datasets

The research directly addresses issues of privacy and bias by contributing to the research community best practices for ethical data collection, covering purpose, privacy and consent, as well as diversity.

Baseline Method for the Sport Task of MediaEval 2022 with 3D CNNs using Attention Mechanisms

This paper presents the baseline method proposed for the Sports Video task part of the MediaEval 2022 benchmark. This task proposes two subtasks: stroke classification from trimmed videos, and stroke

Fine-grained Activities of People Worldwide

Collect, a free mobile app to record video while simultaneously annotating objects and activities of consented subjects, and provides activity classification and activity detection benchmarks for this dataset, and analyzes baseline results to gain insight into how people around with world perform common activities.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

C3D: Generic Features for Video Analysis

Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.

YouTube-8M: A Large-Scale Video Classification Benchmark

YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

HMDB: A large video database for human motion recognition

This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted.

The Kinetics Human Action Video Dataset

The dataset is described, the statistics are described, how it was collected, and some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset are given.

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.