Learning protein constitutive motifs from sequence data

@article{Tubiana2019LearningPC,
  title={Learning protein constitutive motifs from sequence data},
  author={J{\'e}r{\^o}me Tubiana and Simona Cocco and R{\'e}mi Monasson},
  journal={eLife},
  year={2019},
  volume={8}
}
Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic… 
Selection of sequence motifs and generative Hopfield-Potts models for protein families
TLDR
It is shown that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models, and an approach to parameter reduction is proposed, which is based on selecting collective sequence motifs.
Selection of sequence motifs and generative Hopfield-Potts models for protein families.
TLDR
It is shown that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models, and an approach to parameter reduction is proposed, which is based on selecting collective sequence motifs.
Improving sequence-based modeling of protein families using secondary structure quality assessment.
TLDR
This work introduces two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching, and shows improvement in the detection of non-functional sequences.
Improving sequence-based modeling of protein families using secondary structure quality assessment
TLDR
Two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure are introduced, called Dot Product and Pattern Matching, which help rejecting non-functional sequences generated by graphical models learned from homologous sequence alignments.
Restricted Boltzmann Machines and Sequence Homology Search
TLDR
This thesis seeks to see if the information RBMs learn, as illustrated by Tubiana et al. in a recent paper entitled Learning Protein Constitutive Motifs From Sequence Data, can be utilized in remote homology search.
Multiple probabilistic models extract features from protein sequence data and resolve functional diversity of very different protein families
TLDR
ProfileView proves to outperform three functional classification approaches, CUPP, PANTHER, and a recently developed neural network approach based on Restricted Boltzmann Machines, and overcomes time complexity limitations of the latter.
Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families
TLDR
ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions.
An evolution-based model for designing chorismate mutase enzymes
TLDR
A process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay is described.
Generative power of a protein language model trained on multiple sequence alignments
TLDR
This work proposes and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer, and demonstrates that the resulting sequences generally score better than those generated by Potts models, and even than natural sequences, for homology, coevolution and structure-based measures.
Navigating the amino acid sequence space between functional proteins using a deep learning framework
TLDR
The ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces is confirmed.
...
...

References

SHOWING 1-10 OF 137 REFERENCES
From residue coevolution to protein conformational ensembles and functional dynamics
TLDR
This paper adapts the Boltzmann-learning algorithm to the analysis of homologous protein sequences and develops a coarse-grained protein model specifically tailored to convert the resulting contact predictions to a protein structural ensemble, and analyzes the set of conformations consistent with the observed residue correlations.
Coevolutionary Landscape of Kinase Family Proteins: Sequence Probabilities and Functional Motifs.
Learning generative models for protein fold families
TLDR
A new approach to learning statistical models from multiple sequence alignments (MSA) of proteins, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA, which encodes both the position‐specific conservation statistics and the correlated mutation statistics between sequential and long‐range pairs of residues.
Protein interactions and ligand binding: From protein subfamilies to functional specificity
TLDR
The combined analysis of SDPs in interfaces and ligand-binding sites provides a more complete picture of the organization of protein families, constituting the necessary framework for a large scale analysis of the evolution of protein function.
Natural-like function in artificial WW domains
TLDR
Construction of artificial protein sequences directed only by the SCA showed that the information extracted by this analysis is sufficient to engineer the WW fold at atomic resolution, and it was demonstrated that these artificial WW sequences function like their natural counterparts, showing class-specific recognition of proline-containing target peptides.
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions.
TLDR
The effects of multiple sequence information and different types of conformational constraints on the overall performance of the method are investigated, and the ability of a variety of recently developed scoring functions to recognize the native-like conformations in the ensembles of simulated structures are investigated.
Inverse statistical physics of protein sequences: a key issues review.
TLDR
An overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them are given, and some open questions are discussed.
Correlated mutations in models of protein sequences: phylogenetic and structural effects
TLDR
This paper identifies two reasons why naive use of covariation analysis for protein sequences fails to reliably indicate sequence positions that are spatially proximate and presents a null-model approach to solve this problem.
Variational auto-encoding of protein sequences
TLDR
An embedding of natural protein sequences using a Variational Auto-Encoder is presented and used to predict how mutations affect protein function and to computationally guide exploration of protein sequence space and to better inform rational and automatic protein design.
...
...