BERTology Meets Biology: Interpreting Attention in Protein Language Models

@article{Vig2021BERTologyMB,
  title={BERTology Meets Biology: Interpreting Attention in Protein Language Models},
  author={Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Rajani},
  journal={bioRxiv},
  year={2021}
}
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but… Expand
Visualizing Transformers for NLP: A Brief Survey
TLDR
A survey on explaining Transformer architectures through visualizations, which examines the various Transformer facets that can be explored through visual analytics and proposes a set of requirements for future Transformer visualization frameworks. Expand
Transformer protein language models are unsupervised structure learners
TLDR
The highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model. Expand
Is Transfer Learning Necessary for Protein Landscape Prediction?
TLDR
It is shown that CNN models trained solely using supervised learning both compete with and sometimes outperform the best models from TAPE that leverage expensive pretraining on large protein datasets. Expand
Embeddings from deep learning transfer GO annotations beyond homology
TLDR
This work proposes predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space, and suggests this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions. Expand
Embeddings from deep learning transfer GO annotations beyond homology
TLDR
This work proposes predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space, likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions. Expand
EGRET: Edge Aggregated Graph Attention Networks and Transfer Learning Improve Protein-Protein Interaction Site Prediction
Motivation Protein-protein interactions are central to most biological processes. However, reliable identification of protein-protein interaction (PPI) sites using conventional experimental methodsExpand
Fixed-Length Protein Embeddings using Contextual Lenses
TLDR
Transformer (BERT) protein language models that are pretrained on the TrEMBL data set and learn fixed-length embeddings on top of them with contextual lenses are considered, showing that for nearest-neighbor family classification, pretraining offers a noticeable boost in performance and that the corresponding learnedembeddings are competitive with BLAST. Expand
Transformers with Competitive Ensembles of Independent Mechanisms
TLDR
This work proposes Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention, and proposes a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. Expand
Language Models are Open Knowledge Graphs
TLDR
This paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision, and proposes an unsupervised method to cast the knowledge contained within language models into KGs. Expand
BERTMHC: Improves MHC-peptide class II interaction prediction with transformer and multiple instance learning
TLDR
A transformer neural network model which leverages on self-supervised pretraining from a large corpus of protein sequences outperforms state-of-the-art models for both binding and mass spectrometry presentation predictions. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 115 REFERENCES
Generative Models for Graph-Based Protein Design
TLDR
This framework significantly improves in both speed and robustness over conventional and deep-learning-based methods for structure-based protein sequence design, and takes a step toward rapid and targeted biomolecular design with the aid of deep generative models. Expand
Learning protein sequence embeddings using information from structure
TLDR
A framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information that outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, the goal. Expand
Unified rational protein engineering with sequence-only deep representation learning
TLDR
This work applies deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. Expand
Accelerating Protein Design Using Autoregressive Generative Models
TLDR
This work borrows from recent advances in natural language processing and speech synthesis to develop a generative deep neural network-powered autoregressive model for biological sequences that captures functional constraints without relying on an explicit alignment structure. Expand
exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models
TLDR
ExBERT provides insights into the meaning of the contextual representations and attention by matching a human-specified input to similar contexts in large annotated datasets, and can quickly replicate findings from literature and extend them to previously not analyzed models. Expand
ProGen: Language Modeling for Protein Generation
TLDR
This work poses protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations and trains a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags. Expand
Evaluating Protein Transfer Learning with TAPE
TLDR
It is found that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases and suggesting a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. Expand
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
TLDR
This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state of theart features for long-range contact prediction. Expand
ProteinNet: a standardized data set for machine learning of protein structure
TLDR
The ProteinNet series of data sets were created to provide a standardized mechanism for training and assessing data-driven models of protein sequence-structure relationships and to create validation sets distinct from the official CASP sets that faithfully mimic their difficulty. Expand
NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning
TLDR
An updated and extended version of the tool that can predict the most important local structural features with unprecedented accuracy and run-time is presented, and the processing time has been optimized to allow predicting more than 1,000 proteins in less than 2 hours, and complete proteomes in less of 1 day. Expand
...
1
2
3
4
5
...