• Corpus ID: 238582653

Vector-quantized Image Modeling with Improved VQGAN

  title={Vector-quantized Image Modeling with Improved VQGAN},
  author={Jiahui Yu and Xin Li and Jing Yu Koh and Han Zhang and Ruoming Pang and James Qin and Alexander Ku and Yuanzhong Xu and Jason Baldridge and Yonghui Wu},
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformerbased VQGAN (ViT… 
High-Resolution Image Synthesis with Latent Diffusion Models
These latent diffusion models achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation
This work explores a desirable property of image quantizers, called ‘Translation Equivariance in the Quantized Space’, and proposes a simple but effective way to achieve translation equivariance by regularizing orthogonality in the codebook embedding vectors.
Multimodal Image Synthesis and Editing: A Survey
This survey comprehensively contextualizes the advance of the recent multimodal image synthesis & editing and formulate taxonomies according to data modality and model architectures and provides insights into the current research challenges and possible future research directions.


Image Transformer
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks.
Generative Pretraining From Pixels
This work trains a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure, and finds that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification.
Generating Diverse High-Fidelity Images with VQ-VAE-2
It is demonstrated that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN's known shortcomings such as mode collapse and lack of diversity.
Unsupervised Representation Learning by Predicting Image Rotations
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.
BEiT: BERT Pre-Training of Image Transformers
A self-supervised vision representation model BEIT, which stands for Bidirectional Encoder representation from Image Transformers, is introduced and Experimental results on image classification and semantic segmentation show that the model achieves competitive results with previous pre-training methods.
Large Scale Adversarial Representation Learning
This work builds upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator, and demonstrates that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation.
Conditional Image Generation with PixelCNN Decoders
The gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
Neural Discrete Representation Learning
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Emerging Properties in Self-Supervised Vision Transformers
This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets), and implements DINO, a form of self-distillation with no labels, which is implemented into a simple self- supervised method.
Unsupervised Visual Representation Learning by Context Prediction
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.