• Corpus ID: 235390482

Adaptive machine learning for protein engineering

  title={Adaptive machine learning for protein engineering},
  author={Brian L. Hie and Kevin Kaichuang Yang},
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine… 

Figures and Tables from this paper

Now What Sequence? Pre-trained Ensembles for Bayesian Optimization of Protein Sequences

It is shown how to use pre-trained sequence models in Bayesian optimization to design new protein sequences with minimal labels, and significantly fewer labeled sequences are required for three sequence design tasks, including creating novel peptide inhibitors with AlphaFold.

Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics

  • Computer Science
  • 2022
This paper performs a systematic study of various design choices that arise in in protein design, grounded in the problem of optimizing for protein stability, and uses these insights to propose workflows, protocols and metrics to assist practitioners in effectively applying data-driven approaches to protein design problems.

Conformal prediction for the design problem

This work introduces a method to quantify predictive uncertainty in such settings by constructing confidence sets for predictions that account for the dependence between the training and test data.

Efficient evolution of human antibodies from general protein language models and sequence information alone

It is reported that deep learning algorithms known as protein language models can evolve human antibodies with high efficiency, despite providing the models with no information about the target antigen, binding specificity, or protein structure, and also requiring no additional task-specific finetuning or supervision.

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

A road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the “state of the art” on research in the AI–PS binomial until February 2021 is presented.

Intelligent host engineering for metabolic flux optimisation in biotechnology

The relevant issues are rehearse for those wishing to understand and exploit those modern genome-wide host engineering tools and thinking that have been designed and developed to optimise fluxes towards desirable products in biotechnological processes, with a focus on microbial systems.

A generative recommender system with GMM prior for cancer drug generation and sensitivity prediction

VADEERS offers a comprehensive model of drugs’ and cell lines’ properties and relationships between them, as well as a pre-computed clustering of the drugs by their inhibitory profiles.

Deep Extrapolation for Attribute-Enhanced Generation

Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation,

Synthetic Biology: Bottom-Up Assembly of Molecular Systems.

The bottom-up assembly of biological and chemical components opens exciting opportunities to engineer artificial vesicular systems for applications with previously unmet requirements. The modular



Machine-learning-guided directed evolution for protein engineering

The steps required to build machine-learning sequence–function models and to use those models to guide engineering are introduced and the underlying principles of this engineering paradigm are illustrated with the help of case studies.

Data-driven computational protein design.

Advances in machine learning for directed evolution.

Machine learning-assisted directed protein evolution with combinatorial libraries

It is proposed that the expense of experimentally testing a large number of protein variants can be decreased and the outcome can be improved by incorporating machine learning with directed evolution, and that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches.

Low-N protein engineering with data-efficient deep learning

A machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution is introduced.

ProGen: Language Modeling for Protein Generation

This work poses protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations and trains a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags.

Protein sequence design with deep generative models

Protein design and variant prediction using autoregressive generative models

This work introduces a deep generative model adapted from natural language processing for prediction and design of diverse functional sequences without the need for alignments and successfully design and test a diverse 105-nanobody library that shows better expression than a 1000-fold larger synthetic library.

An evolution-based model for designing chorismate mutase enzymes

A process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay is described.

Model-based reinforcement learning for biological sequence design

A model-based variant of PPO, DyNA-PPO, is proposed to improve sample efficiency and performs significantly better than existing methods in settings in which modeling is feasible, while still not performing worse in situations in which a reliable model cannot be learned.