• Corpus ID: 222341585

Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning

@article{Yang2020BackTT,
  title={Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning},
  author={Xinyu Yang and Majid Mirmehdi and Tilo Burghardt},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.07217}
}
In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction~(CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forward-backward as well as backward-forward temporal loops is approximately preserved. As a self… 

Figures and Tables from this paper

Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity
TLDR
This work formulate three novel continuity-related pretext tasks, i.e. continuity justification, discontinuity localization, and missing section approximation, that jointly supervise a shared backbone for video representation learning and encourages the backbone network to learn local and long-ranged motion and context representations.
Auxiliary Learning for Self-Supervised Video Representation via Similarity-based Knowledge Distillation
TLDR
This work proposes a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD, for better generalisation with a significantly smaller amount of video data, e.g. Kinetics-100 rather than Kinetic-400.
Unsupervised Visual Representation Learning by Tracking Patches in Video
TLDR
A Catch-the-Patch game for a 3D-CNN model to learn visual representations that would help with video-related tasks and is pleasantly surprised to find that CtP-pretrained representation achieves much higher action classification accuracy than its fully supervised counterpart on Something-Something dataset.

References

SHOWING 1-10 OF 60 REFERENCES
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
TLDR
This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning
TLDR
A new way to perceive the playback speed and exploit the relative speed between two video clips as labels is proposed to provide more effective and stable supervision for representation learning and ensure the learning of appearance features.
HMDB: A large video database for human motion recognition
TLDR
This paper uses the largest action video database to-date with 51 action categories, which in total contain around 7,000 manually annotated clips extracted from a variety of sources ranging from digitized movies to YouTube, to evaluate the performance of two representative computer vision systems for action recognition and explore the robustness of these methods under various conditions.
Memory-augmented Dense Predictive Coding for Video Representation Learning
TLDR
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories.
SpeedNet: Learning the Speediness in Videos
TLDR
This work applies SpeedNet, a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up, to generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly.
Evolving Losses for Unsupervised Video Representation Learning
TLDR
An unsupervised representation evaluation metric is proposed using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law, which produces similar results to weakly-supervised, task-specific ones.
Momentum Contrast for Unsupervised Visual Representation Learning
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a
Learning Video Representations using Contrastive Bidirectional Transformer
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and
Video Representation Learning by Dense Predictive Coding
TLDR
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that
...
...