Deep generative models of genetic variation capture mutation effects

@article{Riesselman2017DeepGM,
  title={Deep generative models of genetic variation capture mutation effects},
  author={Adam J. Riesselman and John Ingraham and Debora S. Marks},
  journal={bioRxiv},
  year={2017}
}
The functions of proteins and RNAs are determined by a myriad of interactions between their constituent residues, but most quantitative models of how molecular phenotype depends on genotype must approximate this by simple additive effects. While recent models have relaxed this constraint to also account for pairwise interactions, these approaches do not provide a tractable path towards modeling higher-order epistasis. Here, we show how latent variable models with nonlinear dependencies can be… 
Variational auto-encoding of protein sequences
TLDR
An embedding of natural protein sequences using a Variational Auto-Encoder is presented and used to predict how mutations affect protein function and to computationally guide exploration of protein sequence space and to better inform rational and automatic protein design.
Uninterpretable interactions: epistasis as uncertainty
TLDR
It is concluded that epistasis should be treated as a random, but quantifiable, variation in genotype-phenotype maps and that mechanistic, nonlinear models need to account for epistasis and decompose genotypes.
Learning protein constitutive motifs from sequence data
TLDR
It is shown that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information and be used to unveil and exploit the genotype–phenotype relationship for protein families.
Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins
TLDR
It is shown that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA.
Deep Learning of Protein Structural Classes: Any Evidence for an ‘Urfold’?
TLDR
This work describes the training of DL models on protein domain structures (and their associated physicochemical properties) in order to evaluate classification properties at CATH’s “homologous superfamily” (SF) level, utilizing a convolutional autoencoder model architecture.
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model
TLDR
The result shows that the proposed method can effectively capture the inter-residue correlations and improves the performance of contact prediction by up to 9% compared to the MLM baseline under the same setting, revealing the potential of the sequence pre-training method to surpass MSA based methods in general.
Enhancing coevolution-based contact prediction by imposing structural self-consistency of the contacts
TLDR
CE-YAPP consistently improves contact prediction from multiple sequence alignments, in particular for proteins that are difficult targets, and is shown to be in better agreement with those determined using traditional methods in structural biology.
Statistics, machine learning and deep learning for population genetic inference
TLDR
This dissertation addresses several foundational questions in statistics-based and machine learning-based inference, contributing several the-state-of-the-art statistical tools for population genetic inference.
RITA: a Study on Scaling Up Generative Protein Sequence Models
TLDR
This work conducts the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain, and evaluates RITA models in next amino acid prediction, zero-shotness, and enzyme function prediction, showing benefits from increased scale.
Machine-learning-guided directed evolution for protein engineering
TLDR
The steps required to build machine-learning sequence–function models and to use those models to guide engineering are introduced and the underlying principles of this engineering paradigm are illustrated with the help of case studies.
...
...

References

SHOWING 1-10 OF 84 REFERENCES
Quantification of the effect of mutations using a global probability model of natural sequence variation
TLDR
This work presents a statistical approach for quantifying the contribution of residues and their interactions to protein function, using a statistical energy, the evolutionary Hamiltonian, and finds that these probability models predict the experimental effects of mutations with reasonable accuracy for a number of proteins.
Mutation effects predicted from sequence co-variation
TLDR
This work presents EVmutation, an unsupervised statistical method for predicting the effects of mutations that explicitly captures residue dependencies between positions and shows that it outperforms methods that do not account for epistasis.
Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data
TLDR
It is shown that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases and applies it to an atlas of human enhancers to show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.
Variational auto-encoding of protein sequences
TLDR
An embedding of natural protein sequences using a Variational Auto-Encoder is presented and used to predict how mutations affect protein function and to computationally guide exploration of protein sequence space and to better inform rational and automatic protein design.
The spatial architecture of protein function and adaptation
TLDR
A high-throughput quantitative method is developed for a comprehensive single-mutation study in which every position is substituted individually to every other amino acid and shows that sector positions are functionally sensitive to mutation, whereas non-sector positions are more tolerant to substitution.
Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1
TLDR
A novel inference scheme for mutational landscapes, which is based on the statistical analysis of large alignments of homologs of the protein of interest, is developed, able to capture epistatic couplings between residues, and therefore to assess the dependence of mutational effects on the sequence context where they appear.
A general framework for estimating the relative pathogenicity of human genetic variants
TLDR
The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.
Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes
TLDR
Comprehensive single-substitution mutational scanning of APH(3′)II, a Tn5 transposon-derived kinase that confers resistance to aminoglycoside antibiotics, in Escherichia coli under selection with each of six structurally diverse antibiotics at a range of inhibitory concentrations found that the resulting local fitness landscapes showed significant dependence on both antibiotic structure and concentration.
Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models.
TLDR
The pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques.
Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations
TLDR
An improved method for measuring the effects of protein mutations that more closely replicates the natural evolutionary forces, and thereby a more realistic view of the mutational space of proteins is provided.
...
...