VideoSSL: Semi-Supervised Learning for Video Classification

  title={VideoSSL: Semi-Supervised Learning for Video Classification},
  author={Longlong Jing and Toufiq Parag and Zhe Wu and Yingli Tian and Hongcheng Wang},
  journal={2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
We propose a semi-supervised learning approach for video classification, VideoSSL, using convolutional neural networks (CNN). Like other computer vision tasks, existing supervised video classification methods demand a large amount of labeled data to attain good performance. However, annotation of a large dataset is expensive and time consuming. To minimize the dependence on a large annotated dataset, our proposed semi-supervised method trains from a small number of labeled examples and exploits… 
Learning from Temporal Gradient for Semi-supervised Action Recognition
This paper introduces temporal gradient as an additional modality for more attentive feature extraction in semi-supervised video action recognition and explicitly distills the fine-grained motion representations from temporal gradient (TG) and imposes consistency across different modalities.
Semi-supervised Learning Combining 2DCNNs and Video Compression for Action Recognition
In these experiments, the approach reduces over-fitting and boosts the performance of semi-supervised action recognition, and uses 2DCNNs because compressed video representations already hold temporal information.
Multiview Pseudo-Labeling for Semi-supervised Learning from Video
A multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video to learn stronger video representations than from purely supervised data.
Motion-Augmented Self-Training for Video Recognition at Smaller Scale
The first motionaugmented self-training regime for 3D convolutional neural network deployment on an unlabeled video collection, which outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semisupervised learning by 9%-18% using the same amount of class labels.
Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition
  • Yinghao Xu, Fangyun Wei, +5 authors Stephen Lin
  • Computer Science
  • 2021
This work proposes a more effective pseudo-labeling scheme, called Cross-Model PseudoLabeling (CMPL), which introduces a lightweight auxiliary network in addition to the primary backbone, and asks two models to predict pseudo-labels for each other, and observes that these two models tend to learn complementary representations from the same video clips.
Species Classification in Thermal Imaging Videos
We examine a labelled dataset of thermal imaging recordings of animals in New Zealand bush, for the purpose of developing models which can automate the labelling of species. After approaching the
Learning Representational Invariances for Data-Efficient Action Recognition
This paper investigates various data augmentation strategies that capture different video invariances, including photometric, geometric, temporal, and actor/scene augmentations, and shows that this strategy leads to promising performance on the Kinetics-100, UCF-101, and HMDB-51 datasets in the low-label regime.
In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning
This work proposes an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process and generalizes the pseudo- labeling process, allowing for the creation of negative pseudo-labels.
Videoların Derin Öğrenme ile Sınıflandırılarak Filtrelenmesi
Semi-Supervised Learning for Sparsely-Labeled Sequential Data: Application to Healthcare Video Processing
A semisupervised machine learning training strategy to improve event detection performance on sequential data, such as video recordings, when only sparse labels are available, and it is shown that neural networks can improve their detection performance by leveraging more training data with less conservative approximations despite the higher proportion of incorrect labels.


DistInit: Learning Video Representations Without a Single Labeled Video
This work proposes an alternative approach to learning video representations that requires no semantically labeled videos, and instead leverages the years of effort in collecting and labeling large and clean still-image datasets, and obtains strong transfer performance.
Large-Scale Video Classification with Convolutional Neural Networks
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning
An unsupervised loss function is proposed that takes advantage of the stochastic nature of these methods and minimizes the difference between the predictions of multiple passes of a training sample through the network.
Semi-supervised convolutional neural networks for human activity recognition
This paper presents two semi-supervised methods based on convolutional neural networks (CNNs) to learn discriminative hidden features and shows that their CNNs outperform supervised methods and traditional semi- supervised learning methods by up to 18% in mean F1-score (Fm).
Billion-scale semi-supervised learning for image classification
This paper proposes a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images to improve the performance for a given target architecture, like ResNet-50 or ResNext.
Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition
The primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets.
Beyond short snippets: Deep networks for video classification
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Two-Stream Convolutional Networks for Action Recognition in Videos
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.