Corpus ID: 239885427

Hierarchical Transformers Are More Efficient Language Models

@article{Nawrot2021HierarchicalTA,
  title={Hierarchical Transformers Are More Efficient Language Models},
  author={Piotr Nawrot and Szymon Tworkowski and Michal Tyrolski and Lukasz Kaiser and Yuhuai Wu and Christian Szegedy and Henryk Michalewski},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.13711}
}
Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that… Expand

References

SHOWING 1-10 OF 38 REFERENCES
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
TLDR
This work proposes Funnel-Transformer, a model which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost and outperforms the standard Transformer on a wide variety of sequence-level prediction tasks. Expand
Learning Multiple Layers of Features from Tiny Images
TLDR
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network. Expand
Efficient Content-Based Sparse Attention with Routing Transformers
TLDR
This work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest, and shows that this model outperforms comparable sparse attention models on language modeling on Wikitext-103, as well as on image generation on ImageNet-64 while using fewer self-attention layers. Expand
Score Matching Model for Unbounded Data Score
TLDR
This paper introduces Unbounded Diffusion Model (UDM) that resolves the score diverging problem with an easily applicable modification to any diffusion models, and introduces a new SDE that overcomes the theoretic and practical limitations of Variance Exploding SDE. Expand
Axial Attention in Multidimensional Transformers
TLDR
Axial Transformers is proposed, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors that maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation. Expand
U-Net: Convolutional Networks for Biomedical Image Segmentation
TLDR
It is shown that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Expand
Reformer: The Efficient Transformer
TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
Addressing Some Limitations of Transformers with Feedback Memory
  • arXiv:2002.09402.
  • 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models
TLDR
It is shown that a standard Transformer architecture can be used with minimal modifications to process byte sequences, and it is demonstrated that byte-level models are competitive with their token-level counterparts and perform better on tasks that are sensitive to spelling and pronunciation. Expand
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
TLDR
CANINE is presented, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. Expand
...
1
2
3
4
...