• Corpus ID: 209323787

Axial Attention in Multidimensional Transformers

@article{Ho2019AxialAI,
  title={Axial Attention in Multidimensional Transformers},
  author={Jonathan Ho and Nal Kalchbrenner and Dirk Weissenborn and Tim Salimans},
  journal={ArXiv},
  year={2019},
  volume={abs/1912.12180}
}
We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over… 
Improved Transformer for High-Resolution GANs
TLDR
The proposed HiT is an important milestone for generators in GANs which are completely free of convolutions and has a nearly linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images.
XCiT: Cross-Covariance Image Transformers
TLDR
This work proposes a “transposed” version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries, and has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems
TLDR
This work investigates the problem of approximating the two central components of the Transformer — multi-head self-attention and point-wise feed-forward transformation, with reduced parameter space and computational complexity, and forms a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers.
Combiner: Full Attention Transformer with Sparse Computation Cost
TLDR
Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks, and shows that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
KVT: k-NN Attention for Boosting Vision Transformers
TLDR
A sparse attention scheme, dubbed k-NN attention, which naturally inherits the local bias of CNNs without introducing convolutional operations, and allows for the exploration of long range correlation and filter out irrelevant tokens by choosing the most similar tokens from the entire image.
Long-Short Transformer: Efficient Transformers for Language and Vision
TLDR
The proposed Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks, aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
TLDR
Adaptive Fourier Neural Operator is proposed as an efficient token mixer that learns to mix in the Fourier domain that can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms for few-shot segmentation in terms of both efficiency and accuracy.
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
TLDR
This work describes an efficient hierarchical method to compute attention in the Transformer architecture that exploits a matrix structure similar to the Hierarchical Matrix developed by the numerical analysis community, and has linear run time and memory complexity.
Vision Transformer with Progressive Sampling
TLDR
An iterative and progressive sampling strategy to locate discriminative regions and when combined with the Vision Transformer, the obtained PS-ViT network can adaptively learn where to look.
Scalable Visual Transformers with Hierarchical Pooling
TLDR
A Hierarchical Visual Transformer (HVT) is proposed which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs).
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 17 REFERENCES
Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling
TLDR
The Subscale Pixel Network (SPN) is proposed, a conditional decoder architecture that generates an image as a sequence of sub-images of equal size that compactly captures image-wide spatial dependencies and requires a fraction of the memory and the computation required by other fully autoregressive models.
Image Transformer
TLDR
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks.
PixelSNAIL: An Improved Autoregressive Generative Model
TLDR
This work introduces a new generative model architecture that combines causal convolutions with self attention and presents state-of-the-art log-likelihood results on CIFAR-10 and ImageNet.
Generating Long Sequences with Sparse Transformers
TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Generative Image Modeling Using Spatial LSTMs
TLDR
This work introduces a recurrent image model based on multidimensional long short-term memory units which is particularly suited for image modeling due to their spatial structure and outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Scaling Autoregressive Video Models
TLDR
It is shown that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism.
Conditional Image Generation with PixelCNN Decoders
TLDR
The gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
Pixel Recurrent Neural Networks
TLDR
A deep neural network is presented that sequentially predicts the pixels in an image along the two spatial dimensions and encodes the complete set of dependencies in the image to achieve log-likelihood scores on natural images that are considerably better than the previous state of the art.
PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications
TLDR
This work discusses the implementation of PixelCNNs, a recently proposed class of powerful generative models with tractable likelihood that contains a number of modifications to the original model that both simplify its structure and improve its performance.
...
1
2
...