• Corpus ID: 11241677

YouTube-8M: A Large-Scale Video Classification Benchmark

@article{AbuElHaija2016YouTube8MAL,
  title={YouTube-8M: A Large-Scale Video Classification Benchmark},
  author={Sami Abu-El-Haija and Nisarg Kothari and Joonseok Lee and Apostol Natsev and George Toderici and Balakrishnan Varadarajan and Sudheendra Vijayanarasimhan},
  journal={ArXiv},
  year={2016},
  volume={abs/1609.08675}
}
Many recent advancements in Computer Vision are attributed to large datasets. [] Key Method Then, we decoded each video at one-frame-per-second, and used a Deep CNN pre-trained on ImageNet to extract the hidden representation immediately prior to the classification layer. Finally, we compressed the frame features and make both the features and video-level labels available for download. We trained various (modest) classification models on the dataset, evaluated them using popular evaluation metrics, and…

Figures and Tables from this paper

Deep Learning YouTube Video Tags
TLDR
This project seeks to combine state-of-the-art 2 deep learning methods to the problem of automatically labelling video frame data with a hybrid CNN-RNN architecture that takes the image features generated for each video, and combines them with an LSTM model run over the word embeddings of the label set to generate label predictions that take label correlation and dependency into account.
An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform
TLDR
This work built several baseline predictions according to the benchmark paper and public github tensorflow code and improved global prediction accuracy (GAP) from base level 77% to 80.7% through approaches of ensemble.
Large scale video classification using both visual and audio features on YouTube-8 M dataset
TLDR
This paper explores several models of different combination of video-level visual and audio features that provide a promising classifier for the Youtube-8M Kaggle challenge a video classification task for a dataset of 7 million YouTube videos belonging to 4716 classes.
Large-Scale Training Framework for Video Annotation
TLDR
A MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth.
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
TLDR
A new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB), which consists of approximately 380,000 video segments automatically selected to feature objects in natural settings without editing or post-processing.
YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark
TLDR
A new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS) is built which aims to establish baselines for the development of new algorithms in the future.
Low-Complexity Video Classification using Recurrent Neural Networks
TLDR
This paper trains several deep neural networks for video classification on a subset of YouTube-8M based on extracting frame-level features using the Inception-v3 network, which are later used by recurrent neural networks with LSTM/BiLSTM units forVideo classification.
NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels
TLDR
This work creates a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information, and utilizes the collected dataset for action classification and demonstrates its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51.
Indexed Dataset from YouTube for a Content-Based Video Search Engine
TLDR
This paper introduces a novel large-scaled dataset based on YouTube video links to evaluate the proposed content-based video search engine, gathered 1088 videos that represent more than 65 hours of video, 11,000 video shots, and 66,000 unmarked and marked keyframes, 80 different object names used for marking.
Large-Scale YouTube-8M Video Understanding with Deep Neural Networks
TLDR
Three models provided to address video classification using recently announced YouTube-8M large-scale dataset based on frame pooling approach, LSTM networks, and Mixture of Experts intermediate layer, allowing to increase model capacity without dramatically increasing computations.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
C3D: Generic Features for Video Analysis
TLDR
Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.
HMDB: A large video database for human motion recognition
TLDR
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Learning realistic human actions from movies
TLDR
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
A discriminative CNN video representation for event detection
TLDR
This paper proposes using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable, in a new state-of-the-art performance in event detection over the largest video datasets.
The New Data and New Challenges in Multimedia Research
TLDR
The rationale behind the creation of the YFCC100M, the largest public multimedia collection that has ever been released, is explained, as well as the implications the dataset has for science, research, engineering, and development.
Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks
TLDR
Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics.
ActivityNet: A large-scale video benchmark for human activity understanding
TLDR
This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.
...
...