• Corpus ID: 3353110

Image Transformer

@article{Parmar2018ImageT,
  title={Image Transformer},
  author={Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Noam M. Shazeer and Alexander Ku and Dustin Tran},
  journal={ArXiv},
  year={2018},
  volume={abs/1802.05751}
}
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. [] Key Result In a human evaluation study, we show that our super-resolution models improve significantly over previously published autoregressive super-resolution models. Images they generate fool human observers three times more often than the previous state of the art.

Figures and Tables from this paper

MaskGIT: Masked Generative Image Transformer
TLDR
This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which it is demonstrated that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x.
Attention Augmented Convolutional Networks
TLDR
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.
Improved Transformer for High-Resolution GANs
TLDR
The proposed HiT is an important milestone for generators in GANs which are completely free of convolutions and has a nearly linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images.
Vision Transformer with Progressive Sampling
TLDR
An iterative and progressive sampling strategy to locate discriminative regions and when combined with the Vision Transformer, the obtained PS-ViT network can adaptively learn where to look.
Self-Attention Generative Adversarial Networks
TLDR
The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset.
Locally Masked Convolution for Autoregressive Models
TLDR
LMConv is introduced: a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image, achieving improved performance on whole-image density estimation and globally coherent image completions.
ViViT: A Video Vision Transformer
TLDR
This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
Vector-quantized Image Modeling with Improved VQGAN
TLDR
A Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively, and proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Conditional Image Generation with PixelCNN Decoders
TLDR
The gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
Generative Image Modeling Using Spatial LSTMs
TLDR
This work introduces a recurrent image model based on multidimensional long short-term memory units which is particularly suited for image modeling due to their spatial structure and outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting.
PixelSNAIL: An Improved Autoregressive Generative Model
TLDR
This work introduces a new generative model architecture that combines causal convolutions with self attention and presents state-of-the-art log-likelihood results on CIFAR-10 and ImageNet.
The student-t mixture as a natural image patch prior with application to image compression
TLDR
This work demonstrates that the Student-t mixture model convincingly surpasses GMMs in terms of log likelihood, achieving performance competitive with the state of the art in image patch modeling, and proposes efficient coding schemes that can easily be extended to other unsupervised machine learning models.
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
TLDR
SRGAN, a generative adversarial network (GAN) for image super-resolution (SR), is presented, to its knowledge, the first framework capable of inferring photo-realistic natural images for 4x upscaling factors and a perceptual loss function which consists of an adversarial loss and a content loss.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks
TLDR
This paper proposes Stacked Generative Adversarial Networks (StackGAN) to generate 256 photo-realistic images conditioned on text descriptions and introduces a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold.
Generating Images from Captions with Attention
TLDR
It is demonstrated that the proposed model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
TLDR
This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.
BEGAN: Boundary Equilibrium Generative Adversarial Networks
TLDR
This work proposes a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based Generative Adversarial Networks, which provides a new approximate convergence measure, fast and stable training and high visual quality.
...
1
2
3
...