FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  title={FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness},
  author={Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher R'e},
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware —accounting for reads and writes between levels of GPU memory. We propose FlashAttention , an IO… 
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
The proposed Language-Interfaced Fine-Tuning (LIFT) does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling “no-code machine learning with LMs,” and performs relatively well across a wide range of lowdimensional classification and regression tasks.
Phyloformer: towards fast and accurate phylogeny estimation with self-attention networks
This work presents a radically different approach with a transformer-based network architecture that, given a multiple sequence alignment, predicts all the pairwise evolutionary distances between the sequences, which in turn allow us to accurately reconstruct the tree topology with standard distance-based algorithms.


Self-attention Does Not Need O(n2) Memory
This work provides a practical implementation for accelerators that requires O( √ n) memory, is numerically stable, and is within a few percent of the runtime of the standard implementation of attention, and demonstrates how to differentiate the function while remaining memory-efficient.
Data Movement Is All You Need: A Case Study on Optimizing Transformers
This work finds that data movement is the key bottleneck when training, and presents a recipe for globally optimizing data movement in transformers to achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT.
Training Deep Nets with Sublinear Memory Cost
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.
Efficient Content-Based Sparse Attention with Routing Transformers
This work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest, and shows that this model outperforms comparable sparse attention models on language modeling on Wikitext-103, as well as on image generation on ImageNet-64 while using fewer self-attention layers.
Transformer Quality in Linear Time
This work revisit the design choices in Transformers, and proposes a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss, and a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality.
In-datacenter performance analysis of a tensor processing unit
  • N. JouppiC. Young D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Do Transformers Need Deep Long-Range Memory?
This work performs a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can been obtained by limiting the range of attention in lower layers of the network.
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
This work proposes Nyströmformer - a model that exhibits favorable scalability as a function of sequence length and performs favorably relative to other efficient self-attention methods.