Controllable protein design with language models

  title={Controllable protein design with language models},
  author={Noelia Ferruz and Birte H{\"o}cker},
  journal={Nat. Mach. Intell.},
The 21st century is presenting humankind with unprecedented environmental and medical challenges. The ability to design novel proteins tailored for specific purposes could transform our ability to respond timely to these issues. Recent advances in the field of artificial intelligence are now setting the stage to make this goal achievable. Protein sequences are inherently similar to natural languages: Amino acids arrange in a multitude of combinations to form structures that carry function, the… 

ProtGPT2 is a deep unsupervised language model for protein design

ProtGPT2 is a language model trained on protein space that generates de noovo protein sequences that follow the principles of natural ones, and has the potential to generate de novo proteins in a high throughput fashion in a matter of seconds.

From sequence to function through structure: deep learning for protein design

This work document recent advances in deep learning assisted protein design from the last three years, present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound.

Protein Language Models and Structure Prediction: Connection and Progression

The similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases are introduced and the types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding.

Designing novel protein structures using sequence generator and AlphaFold2

This work develops a novel protein design pipeline consisting of two deep learning algorithms, ProteinSolver and AlphaFold2, that generates amino acid sequences such that the forces between interacting amino acids are favorable and compatible with the fold.

Accurate and efficient protein sequence design through learning concise local environment of residues

ProDESIGN-LE is presented, an accurate and efficient design approach, which adopts a concise but informative representation of residue’s local environment and trains a transformer to select an appropriate residue at a position from its local environment.

Deep Learning Concepts and Applications for Synthetic Biology

This review presents an overview of synthetic biology-relevant classes of data and deep learning architectures and highlights emerging studies in synthetic biology that capitalize on deep learning to enable novel understanding and design, and discusses challenges and future opportunities.

Machine learning can guide experimental approaches for protein digestibility estimations

This study proposes a machine learning approach to predict the true ileal digestibility of food items with an accuracy of 90% compared to existing experimental techniques, making the process of creating new foods faster, cheaper, and more ethical.

Composition based oxidation state prediction of materials using deep learning

This work proposes a novel deep learning based BERT transformer language model BERTOS for predicting the oxidation states of all elements of inorganic compounds given only their chemical composition and achieves 96.82\% accuracy for all-element oxidation states prediction benchmarked on the cleaned ICSD dataset.



Transformer neural network for protein-specific de novo drug generation as a machine translation problem

This work applies Transformer neural network architecture, a state-of-the-art approach in sequence transduction tasks, to generate novel molecules with predicted ability to bind a target protein by relying on its amino acid sequence only, and generates realistic diverse compounds with structural novelty.

Unified rational protein engineering with sequence-based deep representation learning

Deep learning is applied to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded and broadly applicable to unseen regions of sequence space.

Using deep learning to annotate the protein universe.

This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation, suggesting that deep learning models will be a core component of future protein annotation tools.

Grammar of protein domain architectures

This work employs a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins and concludes that there exists a “quasi-universal grammar" of protein domains.

Improved protein structure prediction using potentials from deep learning

It is shown that a neural network can be trained to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions, and the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures.

De novo protein design by deep network hallucination

Deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute alongside traditional physics-based models to the de novo design of proteins with new functions.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, and finds that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity.

Advances in protein structure prediction and design

Improvements in computational algorithms and technological advances have dramatically increased the accuracy and speed of protein structure modelling, providing novel opportunities for controlling protein function, with potential applications in biomedicine, industry and research.

Learned protein embeddings for machine learning

The predictive power of Gaussian process models trained usingembeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions.

Machine-learning-guided directed evolution for protein engineering

The steps required to build machine-learning sequence–function models and to use those models to guide engineering are introduced and the underlying principles of this engineering paradigm are illustrated with the help of case studies.