Inverse statistical physics of protein sequences: a key issues review.
@article{Cocco2018InverseSP, title={Inverse statistical physics of protein sequences: a key issues review.}, author={Simona Cocco and Christoph Feinauer and Matteo Figliuzzi and R{\'e}mi Monasson and Martin Weigt}, journal={Reports on progress in physics. Physical Society}, year={2018}, volume={81 3}, pages={ 032601 } }
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference…
139 Citations
Selection of sequence motifs and generative Hopfield-Potts models for protein families.
- BiologyPhysical review. E
- 2019
It is shown that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models, and an approach to parameter reduction is proposed, which is based on selecting collective sequence motifs.
Selection of sequence motifs and generative Hopfield-Potts models for protein families
- BiologybioRxiv
- 2019
It is shown that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models, and an approach to parameter reduction is proposed, which is based on selecting collective sequence motifs.
An Extended Ensemble Approach for Protein Sequence Variation
- Biology
- 2022
A generative model for protein sequences based on extended ensembles in statistical physics is explored, infering a latent space with a tunable dimension allowing for de-novo sequences while preserving higher order statistics of the protein family.
Combined approaches from physics , statistics , and computer science for protein structure prediction : ab initio ex unitate vires ( unity is strength ) ?
- Biology
- 2019
From the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences are reviewed.
Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)?
- BiologyF1000Research
- 2018
From the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences are reviewed.
Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences.
- Computer SciencePhysical review. E
- 2020
In both tasks, DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Statistical physics of interacting proteins: impact of dataset size and quality assessed in synthetic sequences
- Computer SciencebioRxiv
- 2020
In both tasks, DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available, and an iterative pairing algorithm (IPA) allows to make predictions even without a training set, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Aligning biological sequences by exploiting residue conservation and coevolution.
- BiologyPhysical review. E
- 2020
DCAlign is presented, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information.
Undersampling and the inference of coevolution in proteins
- BiologybioRxiv
- 2021
This work shows that issues explain the ability of current approaches to predict tertiary contacts between amino acids and the inability to obviously expose larger networks of functionally-relevant, collectively evolving residues called sectors, a necessary foundation for more deeply understanding and improving evolution-based models of proteins.
Learning protein constitutive motifs from sequence data
- Biology, Computer ScienceeLife
- 2019
It is shown that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information and be used to unveil and exploit the genotype–phenotype relationship for protein families.
References
SHOWING 1-10 OF 193 REFERENCES
From residue coevolution to protein conformational ensembles and functional dynamics
- BiologyProceedings of the National Academy of Sciences
- 2015
This paper adapts the Boltzmann-learning algorithm to the analysis of homologous protein sequences and develops a coarse-grained protein model specifically tailored to convert the resulting contact predictions to a protein structural ensemble, and analyzes the set of conformations consistent with the observed residue correlations.
Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners
- Computer Science, BiologyPloS one
- 2014
The quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis for the prediction of residue-residue contacts in proteins and the identification of protein-protein interaction partner in bacterial signal transduction.
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
- Biology, Computer Science
- 1998
This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
Protein 3D Structure Computed from Evolutionary Sequence Variation
- BiologyPloS one
- 2011
Surprisingly, it is found that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures, and the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy.
Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection
- BiologyProceedings of the National Academy of Sciences
- 2014
It is shown that genomic data, physical coarse-grained free energy functions, and family-specific information theoretic models can be combined to give consistent estimates of energy landscape characteristics of natural proteins.
Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models.
- BiologyPhysical review. E, Statistical, nonlinear, and soft matter physics
- 2013
The pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques.
Evolutionary information for specifying a protein fold
- BiologyNature
- 2005
This work attempts to define the sequence rules for specifying a protein fold by computationally creating artificial protein sequences using only statistical information encoded in a multiple sequence alignment and no tertiary structure information.
On the Entropy of Protein Families
- Computer Science, Biology
- 2015
The entropy of protein families is estimated based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments.
Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing
- Biology, Computer ScienceCell
- 2012
Direct coevolutionary couplings reflect biophysical residue interactions in proteins.
- BiologyThe Journal of chemical physics
- 2016
A detailed spectral analysis of the coupling matrices resulting from 70 protein families is performed, to show that they contain quantitative information about the physico-chemical properties of amino-acid interactions.