• Corpus ID: 235390482

Adaptive machine learning for protein engineering

@article{Hie2021AdaptiveML,
  title={Adaptive machine learning for protein engineering},
  author={Brian L. Hie and Kevin Kaichuang Yang},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.05466}
}
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine… 

Figures and Tables from this paper

Now What Sequence? Pre-trained Ensembles for Bayesian Optimization of Protein Sequences

This work shows how to use pretrained sequence models in Bayesian optimization to design new protein sequences with minimal labels, showing significantly fewer labeled sequences are required for many sequence design tasks, including creating novel peptide inhibitors with AlphaFold.

Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics

  • Computer Science
  • 2022
This paper performs a systematic study of various design choices that arise in in protein design, grounded in the problem of optimizing for protein stability, and uses these insights to propose workflows, protocols and metrics to assist practitioners in effectively applying data-driven approaches to protein design problems.

Conformal prediction for the design problem

This work introduces a method to quantify predictive uncertainty in such settings by constructing confidence sets for predictions that account for the dependence between the training and test data.

Efficient evolution of human antibodies from general protein language models and sequence information alone

It is reported that deep learning algorithms known as protein language models can evolve human antibodies with high efficiency, despite providing the models with no information about the target antigen, binding specificity, or protein structure, and also requiring no additional task-specific finetuning or supervision.

CLADE 2.0: Evolution-Driven Cluster Learning-Assisted Directed Evolution

An ensemble of multiple evolutionary scores is constructed to guide the initial sampling in CLADE, a new state-of-art tool for machine learning-assisted directed evolution that efficiently selects a training set within a small informative space using the evolution-driven clustering sampling.

Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field

A road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the “state of the art” on research in the AI–PS binomial until February 2021 is presented.

Intelligent host engineering for metabolic flux optimisation in biotechnology

The relevant issues are rehearse for those wishing to understand and exploit those modern genome-wide host engineering tools and thinking that have been designed and developed to optimise fluxes towards desirable products in biotechnological processes, with a focus on microbial systems.

A generative recommender system with GMM prior for cancer drug generation and sensitivity prediction

VADEERS offers a comprehensive model of drugs’ and cell lines’ properties and relationships between them, as well as a pre-computed clustering of the drugs by their inhibitory profiles.

Deep Extrapolation for Attribute-Enhanced Generation

Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation,

References

SHOWING 1-10 OF 75 REFERENCES

Machine-learning-guided directed evolution for protein engineering

The steps required to build machine-learning sequence–function models and to use those models to guide engineering are introduced and the underlying principles of this engineering paradigm are illustrated with the help of case studies.

Data-driven computational protein design.

Advances in machine learning for directed evolution.

Machine learning-assisted directed protein evolution with combinatorial libraries

It is proposed that the expense of experimentally testing a large number of protein variants can be decreased and the outcome can be improved by incorporating machine learning with directed evolution, and that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches.

Low-N protein engineering with data-efficient deep learning

A machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution is introduced.

ProGen: Language Modeling for Protein Generation

This work poses protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations and trains a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags.

Protein sequence design with deep generative models

Navigating the protein fitness landscape with Gaussian processes

The ability of Gaussian processes to guide the search through protein sequence space by designing, constructing, and testing chimeric cytochrome P450s allowed us to engineer active P450 enzymes that are more thermostable than any previously made by chimeragenesis, rational design, or directed evolution.

Large-scale design and refinement of stable proteins using sequence-only models

A neural network model is reported that predicts protein stability based only on sequences of amino acids, and its performance is demonstrated by evaluating the stability of almost 200,000 novel proteins, providing a baseline for future work in the field.

Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design.

...