Corpus ID: 219781060

# Generative Pretraining From Pixels

@inproceedings{Chen2020GenerativePF,
title={Generative Pretraining From Pixels},
author={Mark Chen and Alec Radford and Jeff Wu and Heewoo Jun and Prafulla Dhariwal and David Luan and Ilya Sutskever},
booktitle={ICML},
year={2020}
}
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On… Expand
241 Citations

#### Figures and Tables from this paper

Efficient Self-supervised Vision Transformers for Representation Learning
• Chunyuan Li, +5 authors Jianfeng Gao
• Computer Science
• ArXiv
• 2021
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Expand
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. Expand
Self-supervised Pre-training with Hard Examples Improves Visual Representations
• Computer Science
• ArXiv
• 2020
This paper proposes new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations, and proves that hard examples are instrumental in improving the generalization of the pre-trained models. Expand
Vector-quantized Image Modeling with Improved VQGAN
• Jiahui Yu, Xin Li, +7 authors Yonghui Wu
• Computer Science
• 2021
A Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively, and proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. Expand
Training data-efficient image transformers & distillation through attention
• Computer Science
• ICML
• 2021
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
Hybrid Generative-Contrastive Representation Learning
• Computer Science
• ArXiv
• 2021
It is demonstrated that a transformer-based encoder-decoder architecture trained with both contrastive and generative losses can learn highly discriminative and robust representations without hurting the generative performance. Expand
On the Bias Against Inductive Biases
• Computer Science
• ArXiv
• 2021
This work analyzes the effect of the inductive biases present in classical convolutional networks on small to moderately-sized isotropic networks used for unsupervised visual feature learning and shows that their removal is not always ideal. Expand
MST: Masked Self-Supervised Transformer for Visual Representation
• Zhaowen Li, +8 authors Jinqiao Wang
• Computer Science
• ArXiv
• 2021
This paper proposes a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. Expand
Towards Learning Convolutions from Scratch
This work proposes $\beta$-LASSO, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully- connected nets. Expand
Generative Models as a Data Source for Multiview Representation Learning
• Computer Science
• ArXiv
• 2021
This paper compares several representation learning methods that can be applied to the setting of learning general-purpose visual representations from a black-box generative model rather than directly from data, and finds that the resulting representations rival those learned directly from real data. Expand

#### References

SHOWING 1-10 OF 79 REFERENCES
Unsupervised Representation Learning by Predicting Image Rotations
• Computer Science
• ICLR
• 2018
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. Expand
Data-Efficient Image Recognition with Contrastive Predictive Coding
This work revisit and improve Contrastive Predictive Coding, an unsupervised objective for learning such representations which make the variability in natural signals more predictable, and produces features which support state-of-the-art linear classification accuracy on the ImageNet dataset. Expand
Unsupervised Visual Representation Learning by Context Prediction
• Computer Science
• 2015 IEEE International Conference on Computer Vision (ICCV)
• 2015
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Expand
Context Encoders: Feature Learning by Inpainting
• Computer Science
• 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2016
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods. Expand
Momentum Contrast for Unsupervised Visual Representation Learning
• Computer Science
• 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2020
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and aExpand
Revisiting Self-Supervised Visual Representation Learning
• Computer Science
• 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2019
This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning. Expand
Selfie: Self-supervised Pretraining for Image Embedding
• Computer Science, Engineering
• ArXiv
• 2019
The pretraining technique called Selfie, which stands for SELFie supervised Image Embedding, generalizes the concept of masked language modeling of BERT to continuous data, such as images, by making use of the Contrastive Predictive Coding loss. Expand
Representation Learning with Contrastive Predictive Coding
• Computer Science, Mathematics
• ArXiv
• 2018
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments. Expand
Extracting and composing robust features with denoising autoencoders
• Mathematics, Computer Science
• ICML '08
• 2008
This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. Expand
Pixel Recurrent Neural Networks
• Computer Science
• ICML
• 2016
A deep neural network is presented that sequentially predicts the pixels in an image along the two spatial dimensions and encodes the complete set of dependencies in the image to achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Expand