OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning

  title={OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning},
  author={Spyros Gidaris and Andrei Bursuc and Gilles Puy and Nikos Komodakis and Matthieu Cord and Patrick P{\'e}rez},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Learning image representations without human supervision is an important and active research field. Several recent approaches have successfully leveraged the idea of making such a representation invariant under different types of perturbations, especially via contrastive-based instance discrimination training. Although effective visual representations should indeed exhibit such invariances, there are other important characteristics, such as encoding contextual reasoning skills, for which… 
Constrained Mean Shift Using Distant Yet Related Neighbors for Representation Learning
This work proposes to generalize MSF algorithm by constraining the search space for nearest neighbors, and shows that this method outperforms MSF in SSL setting when the constraint utilizes a different augmentation of an image, and outperforms PAWS in semi-supervised setting with less training resources when the constraints ensures the NNs have the same pseudolabel as the query.
Self-Supervised Classification Network
We present Self-Classifier – a novel self-supervised endto-end classification learning approach. Self-Classifier learns labels and representations simultaneously in a single-stage end-to-end manner


Learning Representations by Predicting Bags of Visual Words
This work shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.
Unsupervised Visual Representation Learning by Context Prediction
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.
Unsupervised Learning of Dense Visual Representations
View-Agnostic Dense Representation (VADeR) is proposed for unsupervised learning of dense representations of pixelwise representations by forcing local features to remain constant over different viewing conditions through pixel-level contrastive learning.
A Framework For Contrastive Self-Supervised Learning And Designing A New Approach
A conceptual framework that characterizes CSL approaches in five aspects, and shows the utility of this framework by designing Yet Another DIM (YADIM) which achieves competitive results on CIFAR-10, STL-10 and ImageNet, and is more robust to the choice of encoder and the representation extraction strategy.
Context Encoders: Feature Learning by Inpainting
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced.
Unsupervised Representation Learning by Predicting Image Rotations
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.
Selfie: Self-supervised Pretraining for Image Embedding
The pretraining technique called Selfie, which stands for SELFie supervised Image Embedding, generalizes the concept of masked language modeling of BERT to continuous data, such as images, by making use of the Contrastive Predictive Coding loss.
Scaling and Benchmarking Self-Supervised Visual Representation Learning
It is shown that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation and visual navigation using reinforcement learning.