Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks

  title={Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks},
  author={Ananthan Nambiar and Simon Liu and Mark Hopkins and Maeve Heflin and Sergei Maslov and Anna M. Ritz},
The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer neural network that pre-trains task-agnostic sequence representations. This model is fine-tuned to solve two different protein prediction tasks: protein family… 

Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks

This paper proposes a transformer neural network that attends to both sequence and tertiary structure, and shows that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

This work introduces a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments that outperforms masked language modeling alone on all five tasks and suggests that protein sequence models may benefit from leveraging biologically-inspired inductive biases that go beyond existing language modeling techniques in NLP.

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

Tranception is introduced, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art protein prediction performance, and ProteinGym is developed – an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.

ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing.

The results implied that protein LMs learned some of the grammar of the language of life, as the transfer of the most informative embeddings for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches.

Protein Interaction Sites Prediction using Deep Learning

While improving the performance of the currently best program for binding site prediction, DELPHI, it is interesting to notice that some of the best machine learning techniques failed to provide the expected improvement, a fact that will require further investigation.

Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks

This work presents a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database), however, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked Language Modeling (MLM) objective duringPre-training.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments or evolutionary information thereby bypassing expensive database searches.

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention

This work introduces an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and uses it to contrast Potts and Transformers.

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

The result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.



A deep learning framework for improving protein interaction prediction using sequence properties

A deep learning-based framework, named iPPI, for accurately predicting PPI on a proteome-wide scale depended only on sequence information, which can greatly outperform the state-of-the-art prediction methods in identifying PPIs.

UDSMProt: universal deep sequence models for protein classification

A universal deep sequence model that is pretrained on unlabeled protein sequences from Swiss-Prot and finetuned on protein classification tasks and performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them.

Modeling the language of life – Deep Learning Protein Sequences

This work introduces a novel way to represent protein sequences as continuous vectors (embeddings) by using the deep bi-directional model ELMo taken from natural language processing (NLP) and shows that transfer learning can be used to capture biochemical or biophysical properties of protein sequences from large unlabeled sequence databases.

Multifaceted protein–protein interaction prediction based on Siamese residual RCNN

An end-to-end framework, PIPR (Protein–Protein Interaction Prediction Based on Siamese Residual RCNN), for PPI predictions using only the protein sequences, which leverages both robust local features and contextualized information, which are significant for capturing the mutual influence of proteins sequences.

Sequence-based prediction of protein protein interaction using a deep-learning algorithm

This research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

Using Deep Learning to Annotate the Protein Universe

This paper explores an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database, and reports convolutional networks that are significantly more accurate and computationally efficient than BLASTp.

Predicting protein‐protein interactions through sequence‐based deep learning

A novel deep learning framework, DPPI, is presented, which efficiently applies a deep, Siamese‐like convolutional neural network combined with random projection and data augmentation to predict PPIs, leveraging existing high‐quality experimental PPI data and evolutionary information of a protein pair under prediction.

DeepFam: deep learning based alignment-free method for protein family modeling and prediction

DeepFam is introduced, an alignment‐free method that can extract functional information directly from sequences without the need of multiple sequence alignments and was able to detect conserved regions documented in the Prosite database while predicting functions of proteins.

Evolutionary profiles improve protein-protein interaction prediction from sequence

A new approach to predict PPIs from sequence alone which is based on evolutionary profiles and profile-kernel support vector machines improves over the state-of-the-art, in particular for proteins that are sequence-dissimilar to proteins with known interaction partners.

Evolutionary context-integrated deep sequence modeling for protein engineering

Protein engineering seeks to design proteins with improved or novel functions. Compared to rational design and directed evolution approaches, machine learning-guided approaches traverse the fitness