Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information

  title={Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information},
  author={Seonwoo Min and Seunghyun Park and Siwon Kim and Hyun-Soo Choi and Sungroh Yoon},
  journal={IEEE Access},
Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme… 

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

The result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.

Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks

This work presents a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database), however, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked Language Modeling (MLM) objective duringPre-training.

ProteinGLUE multi-task benchmark suite for self-supervised protein modeling

It is shown that pre- training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training, and the larger base model does not outperform the smaller medium.

Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences

It is posited that while LMPT can effectively model pertoken relations, it fails at modeling per-sequence relations in non-natural language domains, and a framework is developed that couples LMPt with deep structure-preserving metric learning to produce richer embeddings than can be obtained from L MPT alone.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments or evolutionary information thereby bypassing expensive database searches.

ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing.

The results implied that protein LMs learned some of the grammar of the language of life, as the transfer of the most informative embeddings for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches.

ProtPlat: an efficient pre-training platform for protein classification based on FastText

A pre-training platform for representing protein sequences, called ProtPlat, is proposed, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model.

Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction

A multimodal protein pre-training model with three modalities: sequence, structure, and function (S2F), which incorporates the knowledge from the functional description of proteins extracted from literature or manual annotations for PPIs.

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets

A variety of protein LMs are trained that are likely to illuminate different angles of the protein language, and simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations.

Pretraining model for biological sequence data

A broad review of popular pretraining models for biological sequences based on four categories: CNN, word2vec, LSTM and Transformer, and a novel pretraining scheme for protein sequences and a multitask benchmark for protein Pretraining models.



Modeling aspects of the language of life through transfer-learning protein sequences

Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks and modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods.

Evaluating Protein Transfer Learning with TAPE

It is found that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases and suggesting a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences.

Learning protein sequence embeddings using information from structure

A framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information that outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, the goal.

Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization

This work applies the principle of mutual information maximization between local and global information as a self-supervised pretraining signal for protein embeddings to divide protein sequences into fixed size fragments, and train an autoregressive model to distinguish between subsequent fragments from the same protein and fragments from random proteins.

Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks

This paper proposes an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features and leverages convolutional neural networks with different kernel sizes to extract multiscale local contextual features.

UDSMProt: universal deep sequence models for protein classification

A universal deep sequence model that is pretrained on unlabeled protein sequences from Swiss-Prot and finetuned on protein classification tasks and performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them.

Unified rational protein engineering with sequence-based deep representation learning

Deep learning is applied to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded and broadly applicable to unseen regions of sequence space.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing

For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches.

Selfie: Self-supervised Pretraining for Image Embedding

The pretraining technique called Selfie, which stands for SELFie supervised Image Embedding, generalizes the concept of masked language modeling of BERT to continuous data, such as images, by making use of the Contrastive Predictive Coding loss.