Unsupervised Visual Representation Learning by Context Prediction

@article{Doersch2015UnsupervisedVR,
  title={Unsupervised Visual Representation Learning by Context Prediction},
  author={Carl Doersch and Abhinav Gupta and Alexei A. Efros},
  journal={2015 IEEE International Conference on Computer Vision (ICCV)},
  year={2015},
  pages={1422-1430}
}
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. [...] Key Result Furthermore, we show that the learned ConvNet can be used in the R-CNN framework [19] and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.Expand
Unsupervised Learning of Visual Representations Using Videos
  • X. Wang, Abhinav Gupta
  • Computer Science
  • 2015 IEEE International Conference on Computer Vision (ICCV)
  • 2015
TLDR
A simple yet surprisingly powerful approach for unsupervised learning of CNN that uses hundreds of thousands of unlabeled videos from the web to learn visual representations and designs a Siamese-triplet network with a ranking loss function to train this CNN representation. Expand
Learning Representations by Predicting Bags of Visual Words
TLDR
This work shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far. Expand
Unsupervised Representation Learning by Predicting Image Rotations
TLDR
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. Expand
OCEAN: Object-centric arranging network for self-supervised visual representations learning
TLDR
This paper learns the correct arrangement of object proposals to represent an image using a convolutional neural network without any manual annotations, and discovers the representation that considers overlap, inclusion, and exclusion relationship of proposals as well as their relative position. Expand
A Classification approach towards Unsupervised Learning of Visual Representations
TLDR
A model for foreground and background classification task, in the process of which it learns visual representations, is trained which is close to the best performing unsupervised feature learn- ing technique whereas better than many other proposed al- gorithms. Expand
Weakly-Supervised Spatial Context Networks
TLDR
This work proposes spatial context networks that learn to predict a representation of one image patch from another image patch, within the same image, conditioned on their real-valued relative spatial offset, and illustrates that contextual supervision (with pre- trained model initialization) can improve on existing pre-trained model performance. Expand
Object-Centric Representation Learning from Unlabeled Videos
TLDR
This work introduces a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames in a deep convolutional neural network representation. Expand
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose. Expand
Learning Visual Representations with Caption Annotations
TLDR
It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. Expand
Learning Features by Watching Objects Move
TLDR
Inspired by the human visual system, low-level motion-based grouping cues can be used to learn an effective visual representation that significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 72 REFERENCES
Data-dependent Initializations of Convolutional Neural Networks
TLDR
This work presents a fast and simple data-dependent initialization procedure, that sets the weights of a network such that all units in the network train at roughly the same rate, avoiding vanishing or exploding gradients. Expand
Unsupervised Discovery of Mid-Level Discriminative Patches
TLDR
The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks. Expand
Randomized Max-Margin Compositions for Visual Recognition
A main theme in object detection are currently discriminative part-based models. The powerful model that combines all parts is then typically only feasible for few constituents, which are in turnExpand
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Expand
Discovering objects and their location in images
TLDR
This work treats object categories as topics, so that an image containing instances of several categories is modeled as a mixture of topics, and develops a model developed in the statistical text literature: probabilistic latent semantic analysis (pLSA). Expand
Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
TLDR
This paper proposes a fully automatic algorithm which harvests visual concepts from a large number of Internet images using text-based queries, and shows significant improvement over the competing systems in image classification, including those with strong supervision. Expand
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Expand
Building high-level features using large scale unsupervised learning
TLDR
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Expand
Blocks That Shout: Distinctive Parts for Scene Classification
TLDR
This paper proposes a simple, efficient, and effective method to learn parts incrementally, starting from a single part occurrence with an Exemplar SVM, and can learn parts which are significantly more informative and for a fraction of the cost, compared to previous part-learning methods. Expand
Mid-level Visual Element Discovery as Discriminative Mode Seeking
TLDR
Given a weakly-labeled image collection, this method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels, and proposes the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches. Expand
...
1
2
3
4
5
...