Corpus ID: 219401747

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

@article{Choromanski2020MaskedLM,
  title={Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers},
  author={Krzysztof Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Jared Davis and Tam{\'a}s Sarl{\'o}s and David Belanger and Lucy J. Colwell and Adrian Weller},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.03555}
}
Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding… Expand
A Survey of Transformers
TLDR
This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications. Expand
FNet: Mixing Tokens with Fourier Transforms
TLDR
It is found that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. Expand
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
TLDR
This work describes an efficient hierarchical method to compute attention in the Transformer architecture that exploits a matrix structure similar to the Hierarchical Matrix developed by the numerical analysis community, and has linear run time and memory complexity. Expand
A TRAINABLE OPTIMAL TRANSPORT EMBEDDING
We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, andExpand
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention
TLDR
A parametrized embedding that aggregates the features from a given set according to the optimal transport plan between the set and a trainable reference, which scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Expand
ATTACC the Quadratic Bottleneck of Attention Layers
TLDR
A new attentiontailored dataflow, termed FLAT, is introduced, which leverages operator fusion, loop-nest optimizations, and interleaved execution to increase the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. Expand
An Optimized Dataflow for Mitigating Attention Performance Bottlenecks
TLDR
A new attention-tailored dataflow is introduced, termed FLAT, which identifies fusion opportunities within the attention layer, and implements an on- chip memory-aware interleaved execution and tiling mechanism, which increases the effective memory bandwidth by efficiently utilizing the highbandwidth, low-capacity on-chip buffer. Expand
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
TLDR
This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers, and proves that selfattention possesses a strong inductive bias towards “token uniformity”. Expand
Cluster-Former: Clustering-based Sparse Transformer for Question Answering
TLDR
Cluster-Former is proposed, a novel clusteringbased sparse Transformer to perform attention across chunked sequences that allows information integration beyond local windows, which is especially beneficial for question answering (QA) tasks that rely on long-range dependencies. Expand
Deformable DETR: Deformable Transformers for End-to-End Object Detection
TLDR
Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference, can achieve better performance than DETR (especially on small objects) with 10$\times less training epochs. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 67 REFERENCES
UniProt: a worldwide hub of protein knowledge
TLDR
The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life that has greatly expanded the number of Reference Proteomes that it provides and in particular it has focussed on improving thenumber of viral Reference Protesomes. Expand
Reformer: The Efficient Transformer
TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
Compiling machine learning programs via high-level tracing
TLDR
JAX is described, a domain-specific tracing JIT compiler for generating high-performance accelerator code from pure Python and Numpy machine learning programs that is capable of scaling to multi-core Cloud TPUs and easily programmable and highly performant ML system. Expand
L
  • u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
  • 2017
Random Features for Large-Scale Kernel Machines
TLDR
Two sets of random features are explored, provided convergence bounds on their ability to approximate various radial basis kernels, and it is shown that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large- scale kernel machines. Expand
Longformer: The Long-Document Transformer
TLDR
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks. Expand
ProGen: Language Modeling for Protein Generation
TLDR
This work poses protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations and trains a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags. Expand
Generating Long Sequences with Sparse Transformers
TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
Image Transformer
TLDR
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
...
1
2
3
4
5
...