Corpus ID: 219401747

# Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

@article{Choromanski2020MaskedLM,
title={Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers},
author={Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Jared Davis and Tam{\'a}s Sarl{\'o}s and David Belanger and Lucy J. Colwell and Adrian Weller},
journal={ArXiv},
year={2020},
volume={abs/2006.03555}
}
Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding… Expand

#### Figures and Topics from this paper

A Survey of Transformers
This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications. Expand
FNet: Mixing Tokens with Fourier Transforms
• Computer Science
• ArXiv
• 2021
It is found that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. Expand
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
• Computer Science
• ACL/IJCNLP
• 2021
This work describes an efficient hierarchical method to compute attention in the Transformer architecture that exploits a matrix structure similar to the Hierarchical Matrix developed by the numerical analysis community, and has linear run time and memory complexity. Expand
A TRAINABLE OPTIMAL TRANSPORT EMBEDDING
• Dexiong Chen
• 2021
We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, andExpand
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention
• Computer Science
• ICLR
• 2021
A parametrized embedding that aggregates the features from a given set according to the optimal transport plan between the set and a trainable reference, which scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Expand
ATTACC the Quadratic Bottleneck of Attention Layers
• Computer Science
• ArXiv
• 2021
A new attentiontailored dataflow, termed FLAT, is introduced, which leverages operator fusion, loop-nest optimizations, and interleaved execution to increase the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. Expand
An Optimized Dataflow for Mitigating Attention Performance Bottlenecks
A new attention-tailored dataflow is introduced, termed FLAT, which identifies fusion opportunities within the attention layer, and implements an on- chip memory-aware interleaved execution and tiling mechanism, which increases the effective memory bandwidth by efficiently utilizing the highbandwidth, low-capacity on-chip buffer. Expand
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
• Computer Science
• ICML
• 2021
This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers, and proves that selfattention possesses a strong inductive bias towards “token uniformity”. Expand
Cluster-Former: Clustering-based Sparse Transformer for Question Answering
• Shuohang Wang, +5 authors Jingjing Liu
• Computer Science
• FINDINGS
• 2021
Cluster-Former is proposed, a novel clusteringbased sparse Transformer to perform attention across chunked sequences that allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Expand
Deformable DETR: Deformable Transformers for End-to-End Object Detection
• Computer Science
• ICLR
• 2021
Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference, can achieve better performance than DETR (especially on small objects) with 10$\times less training epochs. Expand #### References SHOWING 1-10 OF 67 REFERENCES UniProt: a worldwide hub of protein knowledge The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life that has greatly expanded the number of Reference Proteomes that it provides and in particular it has focussed on improving thenumber of viral Reference Protesomes. Expand Reformer: The Efficient Transformer • Computer Science, Mathematics • ICLR • 2020 This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand Compiling machine learning programs via high-level tracing • Computer Science • 2018 JAX is described, a domain-specific tracing JIT compiler for generating high-performance accelerator code from pure Python and Numpy machine learning programs that is capable of scaling to multi-core Cloud TPUs and easily programmable and highly performant ML system. Expand L • u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc. • 2017 Random Features for Large-Scale Kernel Machines • Computer Science, Mathematics • NIPS • 2007 Two sets of random features are explored, provided convergence bounds on their ability to approximate various radial basis kernels, and it is shown that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large- scale kernel machines. Expand Longformer: The Long-Document Transformer • Computer Science • ArXiv • 2020 Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks. Expand ProGen: Language Modeling for Protein Generation This work poses protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations and trains a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags. Expand Generating Long Sequences with Sparse Transformers • Computer Science, Mathematics • ArXiv • 2019 This paper introduces sparse factorizations of the attention matrix which reduce this to$O(n)\$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
Image Transformer
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand