Inverse statistical physics of protein sequences: a key issues review.

@article{Cocco2018InverseSP,
  title={Inverse statistical physics of protein sequences: a key issues review.},
  author={Simona Cocco and Christoph Feinauer and Matteo Figliuzzi and R{\'e}mi Monasson and Martin Weigt},
  journal={Reports on progress in physics. Physical Society},
  year={2018},
  volume={81 3},
  pages={
          032601
        }
}
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference… 

Figures from this paper

Selection of sequence motifs and generative Hopfield-Potts models for protein families.
TLDR
It is shown that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models, and an approach to parameter reduction is proposed, which is based on selecting collective sequence motifs.
Selection of sequence motifs and generative Hopfield-Potts models for protein families
TLDR
It is shown that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models, and an approach to parameter reduction is proposed, which is based on selecting collective sequence motifs.
An Extended Ensemble Approach for Protein Sequence Variation
TLDR
A generative model for protein sequences based on extended ensembles in statistical physics is explored, infering a latent space with a tunable dimension allowing for de-novo sequences while preserving higher order statistics of the protein family.
Combined approaches from physics , statistics , and computer science for protein structure prediction : ab initio ex unitate vires ( unity is strength ) ?
TLDR
From the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences are reviewed.
Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)?
TLDR
From the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences are reviewed.
Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences.
TLDR
In both tasks, DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Statistical physics of interacting proteins: impact of dataset size and quality assessed in synthetic sequences
TLDR
In both tasks, DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available, and an iterative pairing algorithm (IPA) allows to make predictions even without a training set, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Aligning biological sequences by exploiting residue conservation and coevolution.
TLDR
DCAlign is presented, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information.
Undersampling and the inference of coevolution in proteins
TLDR
This work shows that issues explain the ability of current approaches to predict tertiary contacts between amino acids and the inability to obviously expose larger networks of functionally-relevant, collectively evolving residues called sectors, a necessary foundation for more deeply understanding and improving evolution-based models of proteins.
Learning protein constitutive motifs from sequence data
TLDR
It is shown that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information and be used to unveil and exploit the genotype–phenotype relationship for protein families.
...
...

References

SHOWING 1-10 OF 193 REFERENCES
From residue coevolution to protein conformational ensembles and functional dynamics
TLDR
This paper adapts the Boltzmann-learning algorithm to the analysis of homologous protein sequences and develops a coarse-grained protein model specifically tailored to convert the resulting contact predictions to a protein structural ensemble, and analyzes the set of conformations consistent with the observed residue correlations.
Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners
TLDR
The quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis for the prediction of residue-residue contacts in proteins and the identification of protein-protein interaction partner in bacterial signal transduction.
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
TLDR
This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis.
Protein 3D Structure Computed from Evolutionary Sequence Variation
TLDR
Surprisingly, it is found that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures, and the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy.
Coevolutionary information, protein folding landscapes, and the thermodynamics of natural selection
TLDR
It is shown that genomic data, physical coarse-grained free energy functions, and family-specific information theoretic models can be combined to give consistent estimates of energy landscape characteristics of natural proteins.
Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models.
TLDR
The pseudolikelihood method, applied to 21-state Potts models describing the statistical properties of families of evolutionarily related proteins, significantly outperforms existing approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques.
Evolutionary information for specifying a protein fold
TLDR
This work attempts to define the sequence rules for specifying a protein fold by computationally creating artificial protein sequences using only statistical information encoded in a multiple sequence alignment and no tertiary structure information.
On the Entropy of Protein Families
TLDR
The entropy of protein families is estimated based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments.
Direct coevolutionary couplings reflect biophysical residue interactions in proteins.
TLDR
A detailed spectral analysis of the coupling matrices resulting from 70 protein families is performed, to show that they contain quantitative information about the physico-chemical properties of amino-acid interactions.
...
...