Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning

@article{Wang2021RemovingTB,
  title={Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning},
  author={Jinpeng Wang and Yuting Gao and Ke Li and Yiqi Lin and Andy Jinhua Ma and Xing Sun},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={11799-11808}
}
  • Jinpeng Wang, Yuting Gao, +3 authors Xing Sun
  • Published 12 September 2020
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself. However, some of the current methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion, making the model vulnerable to background changes. To mitigate the model reliance towards the background, we propose to remove the background impact by adding the… Expand

Figures and Tables from this paper

Video Representation Learning with Graph Contrastive Augmentation
TLDR
This work proposes a novel contrastive self-supervised video representation learning framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal graph and devising a graph augmentation that is designed to enhance the correlation across frames of videos and developing a new view for exploring temporal structure in videos. Expand
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency
TLDR
It is observed that the consistency between positive samples is the key to learn robust video representations and two tasks to learn appearance and speed consistency are proposed, separately. Expand
Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing
TLDR
This paper introduces Contrast and Mix (CoMix), a new contrastive learning framework that aims to learn discriminative invariant feature representations for unsupervised video domain adaptation and proposes a novel extension to the temporal contrastive loss. Expand
DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning
TLDR
This work proposes to distill the final embedding to maximally transmit a teacher’s knowledge to a lightweight model by constraining the last embedding of the student to be consistent with that of the teacher, and achieves the state-of-theart on all lightweight models. Expand
Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization
TLDR
A multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations and a simple temporal modeling module from multi- level features to enhance motion pattern learning is proposed. Expand
How Incomplete is Contrastive Learning? An Inter-intra Variant Dual Representation Method for Self-supervised Video Recognition
  • Lin Zhang, Qi She, Zhengyang Shen, Changhu Wang
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper proposes to learn dual representations for each clip which encode intra-variance through a shuffle-rank pretext task and encode inter-variances through a temporal coherent contrastive loss, and shows that this method plays an essential role in balancing inter and intra variances. Expand
Inter-intra Variant Dual Representations forSelf-supervised Video Recognition
  • Lin Zhang, Qi She, Zhengyang Shen, Changhu Wang
  • Computer Science
  • 2021
TLDR
This paper finds that existing contrastive learning based solutions for self-supervised video recognition focus on inter-variance encoding but ignore the intra-Variance existing in clips within the same video, and proposes to learn dual representations for each clip to play an essential role in balancing inter and intra variances. Expand
Learning Spatio-temporal Representation by Channel Aliasing Video Perception
TLDR
This paper proposes a novel pretext task namely Channel Aliasing Video Perception (CAVP) for self-supervised video representation learning, to recognize the number of different motion flows within a channel aliasing video for perception of discriminative motion cues. Expand
Motion-aware Self-supervised Video Representation Learning via Foreground-background Merging
  • Shuangrui Ding, Maomao Li, +4 authors Jue Wang
  • Computer Science
  • ArXiv
  • 2021
TLDR
Foreground-background Merging (FAME) is proposed to deliberately compose the foreground region of the selected video onto the background of others, which focuses more on the foreground motion pattern and is thus more robust to the background context. Expand
Source-free unsupervised multi-source domain adaptation via proxy task for person re-identification
  • Yi Ding, Zhikui Duan, Shiren Li
  • Computer Science
  • The Visual Computer
  • 2021
TLDR
A novel method for source-free (without accessing any source domain data) multi-source domain adaptation in person re-identification (Re-ID) and the combination of these two proxy tasks can be properly aggregated and adaptively transferred to the target domain without any source data. Expand
...
1
2
...

References

SHOWING 1-10 OF 70 REFERENCES
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
TLDR
The proposed Inter-Intra Contrastive (IIC) framework can train spatio-temporal convolutional networks to learn video representations and outperforms current state-of-the-art results by a large margin. Expand
Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction.
TLDR
With the self-supervised pre-trained 3DRotNet from large datasets, the recognition accuracy is boosted up by 20.4% on UCF101 and 16.7% on HMDB51 respectively, compared to the models trained from scratch. Expand
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning
TLDR
Geometry is explored, a grand new type of auxiliary supervision for the self-supervised learning of video representations, and it is found that the convolutional neural networks pre-trained by the geometry cues can be effectively adapted to semantic video understanding tasks. Expand
Self-supervised Video Representation Learning by Pace Prediction
TLDR
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction -- by introducing contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. Expand
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
TLDR
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video. Expand
DynamoNet: Dynamic Action and Motion Network
TLDR
A novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem is introduced. Expand
Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
TLDR
This paper proposes to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data, and shows that the approach can significantly improve the performance of C3D when applied to video classification tasks. Expand
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines. Expand
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features
TLDR
Patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches, and CutMix consistently outperforms state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on ImageNet weakly-supervised localization task. Expand
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.Expand
...
1
2
3
4
5
...