Corpus ID: 174801241

Selfie: Self-supervised Pretraining for Image Embedding

@article{Trinh2019SelfieSP,
  title={Selfie: Self-supervised Pretraining for Image Embedding},
  author={Trieu H. Trinh and Minh-Thang Luong and Quoc V. Le},
  journal={ArXiv},
  year={2019},
  volume={abs/1906.02940}
}
We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, among other "distractor" patches sampled from the same image, to fill in the masked location. This… Expand

Figures, Tables, and Topics from this paper

G-SimCLR: Self-Supervised Contrastive Learning with Guided Projection via Pseudo Labelling
TLDR
This work proposes that, with the normalized temperature-scaled cross-entropy loss function (as used in SimCLR), it is beneficial to not have images of the same category in the same batch, and uses the latent space representation of a denoising autoencoder trained on the unlabeled dataset to obtain pseudo labels. Expand
Self-supervised Pre-training with Hard Examples Improves Visual Representations
TLDR
This paper proposes new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations, and proves that hard examples are instrumental in improving the generalization of the pre-trained models. Expand
Learning Representations by Predicting Bags of Visual Words
TLDR
This work shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far. Expand
Generative Pretraining From Pixels
TLDR
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. Expand
OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning
TLDR
A teacher-student scheme to learn representations by training a convolutional net to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image, which improves over all prior unsupervised approaches. Expand
BEiT: BERT Pre-Training of Image Transformers
TLDR
A self-supervised vision representation model BEIT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced and Experimental results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods. Expand
UNITER: Learning UNiversal Image-TExt Representations
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. Expand
SSAST: Self-Supervised Audio Spectrogram Transformer
  • Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James R. Glass
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
This paper proposes to pretrain the Audio Spectrogram Transformer model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech, and is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self- supervisedLearning framework for AST. Expand
Rethinking supervised pre-training for better downstream transferring
  • Yutong Feng, Jianwen Jiang, Mingqian Tang, Rong Jin, Yue Gao
  • Computer Science
  • ArXiv
  • 2021
TLDR
A new supervised pre-training method based on Leave-One-Out K-Nearest-Neighbor, or LOOK for short is proposed, which relieves the problem of overfitting upstream tasks by only requiring each image to share its class label with most of its k nearest neighbors, thus allowing each class to exhibit a multi-mode distribution and consequentially preserving part of intra-class difference for better transferring to downstream tasks. Expand
Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning
TLDR
A teacher-student scheme to learn representations by training a convnet to reconstruct a bag-of-visual-words (BoW) representation of an image, given as input a perturbed version of that same image, which improves over all prior unsupervised approaches. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
Revisiting Self-Supervised Visual Representation Learning
TLDR
This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning. Expand
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods. Expand
AutoAugment: Learning Augmentation Policies from Data
TLDR
This paper describes a simple procedure called AutoAugment to automatically search for improved data augmentation policies, which achieves state-of-the-art accuracy on CIFAR-10, CIFar-100, SVHN, and ImageNet (without additional data). Expand
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Expand
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
TLDR
A novel unsupervised learning approach to build features suitable for object detection and classification and to facilitate the transfer of features to other tasks, the context-free network (CFN), a siamese-ennead convolutional neural network is introduced. Expand
Unsupervised Representation Learning by Predicting Image Rotations
TLDR
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. Expand
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion
TLDR
This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations. Expand
Temporal Ensembling for Semi-Supervised Learning
TLDR
Self-ensembling is introduced, where it is shown that this ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Expand
S4L: Self-Supervised Semi-Supervised Learning
TLDR
It is shown that S4L and existing semi-supervised methods can be jointly trained, yielding a new state-of-the-art result on semi- supervised ILSVRC-2012 with 10% of labels. Expand
Unsupervised Data Augmentation for Consistency Training
TLDR
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. Expand
...
1
2
3
4
5
...