Cross Pixel Optical Flow Similarity for Self-Supervised Learning

@inproceedings{Mahendran2018CrossPO,
  title={Cross Pixel Optical Flow Similarity for Self-Supervised Learning},
  author={Aravindh Mahendran and James Thewlis and Andrea Vedaldi},
  booktitle={ACCV},
  year={2018}
}
We propose a novel method for learning convolutional neural image representations without manual supervision. We use motion cues in the form of optical flow, to supervise representations of static images. The obvious approach of training a network to predict flow from a single image can be needlessly difficult due to intrinsic ambiguities in this prediction task. We instead propose a much simpler learning goal: embed pixels such that the similarity between their embeddings matches that between… 
Flow Based Self-supervised Pixel Embedding for Image Segmentation
TLDR
It is demonstrated that image features can be learned in self-supervision by first training an optical flow estimator with synthetic flow data, and then learning image features from the estimated flows in real motion data.
Self-supervised Video Object Segmentation by Motion Grouping
TLDR
A simple variant of the Transformer is introduced to segment optical flow frames into primary objects and the background, which can be trained in a self-supervised manner, i.e. without using any manual annotations, and achieves superior results compared to previous state-of-the-art self- supervised methods on public benchmarks.
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting
TLDR
A novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples, and obtains state-of-the-art results on nearest-neighbour video retrieval and action recognition.
A critical analysis of self-supervision, or what we can learn from a single image
TLDR
It is shown that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used.
Motion-Augmented Self-Training for Video Recognition at Smaller Scale
TLDR
The first motion-augmented self-training regime for 3D convolutional neural network deployment on an unlabeled video collection, which outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.
Self-labelling via simultaneous clustering and representation learning
TLDR
The proposed novel and principled learning formulation is able to self-label visual data so as to train highly competitive image representations without manual labels and yields the first self-supervised AlexNet that outperforms the supervised Pascal VOC detection baseline.
P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding
Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization
Unsupervised Learning of Dense Visual Representations
TLDR
View-Agnostic Dense Representation (VADeR) is proposed for unsupervised learning of dense representations of pixelwise representations by forcing local features to remain constant over different viewing conditions through pixel-level contrastive learning.
Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey
TLDR
An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision
TLDR
This work extensively study and validate the model performance on over 50 benchmarks including fairness, robustness to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets, and discovers that such model is more robust, more fair, less harmful and less biased than supervised models or models trained on objectcentric datasets such as ImageNet.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 68 REFERENCES
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning
TLDR
Geometry is explored, a grand new type of auxiliary supervision for the self-supervised learning of video representations, and it is found that the convolutional neural networks pre-trained by the geometry cues can be effectively adapted to semantic video understanding tasks.
Self-Supervised Learning of Geometrically Stable Features Through Probabilistic Introspection
TLDR
This paper shows empirically that a network pre- trained in this manner requires significantly less supervision to learn semantic object parts compared to numerous pre-training alternatives, and shows that the pre-trained representation is excellent for semantic object matching.
Object-Centric Representation Learning from Unlabeled Videos
TLDR
This work introduces a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames in a deep convolutional neural network representation.
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Colorization as a Proxy Task for Visual Understanding
TLDR
This work investigates and improves self-supervision as a drop-in replacement for ImageNet pretraining, focusing on automatic colorization as the proxy task, and presents the first in-depth analysis of self- supervision via colorization, concluding that formulation of the loss, training details and network architecture play important roles in its effectiveness.
Self-Supervised Feature Learning by Learning to Spot Artifacts
  • S. Jenni, P. Favaro
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A novel self-supervised learning method based on adversarial training to train a discriminator network to distinguish real images from images with synthetic artifacts, and then to extract features from its intermediate layers that can be transferred to other data domains and tasks.
Cross-Domain Self-Supervised Multi-task Feature Learning Using Synthetic Imagery
TLDR
A novel multi-task deep network to learn generalizable high-level visual representations based on adversarial learning is proposed and it is demonstrated that the network learns more transferable representations compared to single-task baselines.
Unsupervised Representation Learning by Predicting Image Rotations
TLDR
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Unsupervised Learning of Visual Representations Using Videos
  • X. Wang, A. Gupta
  • Computer Science
    2015 IEEE International Conference on Computer Vision (ICCV)
  • 2015
TLDR
A simple yet surprisingly powerful approach for unsupervised learning of CNN that uses hundreds of thousands of unlabeled videos from the web to learn visual representations and designs a Siamese-triplet network with a ranking loss function to train this CNN representation.
...
1
2
3
4
5
...