An expectation maximization algorithm for training hidden substitution models.

@article{Holmes2002AnEM,
  title={An expectation maximization algorithm for training hidden substitution models.},
  author={Ian H. Holmes and Gerald M. Rubin},
  journal={Journal of molecular biology},
  year={2002},
  volume={317 5},
  pages={
          753-64
        }
}
  • I. Holmes, G. Rubin
  • Published 12 April 2002
  • Computer Science
  • Journal of molecular biology
We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a… 
Using evolutionary Expectation Maximization to estimate indel rates
TLDR
An algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91’ model) is presented.
Using guide trees to construct multiple-sequence evolutionary HMMs
TLDR
This work presents general algorithms for constructing an Evolutionary HMM from any Pair HMM and for doing dynamic programming to any Multiple-sequence HMM.
An improved general amino acid replacement matrix.
TLDR
This method further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG.
Phylogenetic mixture models for proteins
TLDR
This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution and shows that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement Matrices.
Phylogenetic Motif Detection by Expectation-Maximization on Evolutionary Mixtures
TLDR
This work treats aligned DNA sequence as a mixture of evolutionary models, for motif and background, and provides an algorithm to estimate the parameters by Expectation-Maximization, which can take advantage of phylogenic information to avoid false positives and discover motifs upstream of groups of characterized target genes.
Empirical profile mixture models for phylogenetic reconstruction
TLDR
An expectation-maximization algorithm for estimating amino acid profile mixtures from alignment databases is introduced and it is observed that a set of 20 profiles is enough to provide a better statistical fit than currently available empirical matrices (WAG, JTT), in particular on saturated data.
Pseudo-Likelihood Analysis of Codon Substitution Models with Neighbor-Dependent Rates
TLDR
The pseudo-likelihood estimates are shown to be very accurate, and from analyzing 348 human-mouse coding sequences it is concluded that the incorporation of a CpG effect improves the fit of the model considerably.
The Expectation Maximization (EM) algorithm and some of its applications in Molecular Biology
TLDR
This tutorial gives a unified presentation of the application of EM to molecular biology problems, by first explaining a form of EM which can be used for clustering, and then showing through three examples how different problems can been formalized the same way, with mixtures of probabilistic models specific to each problem, used in combination with the same EM algorithm.
A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
TLDR
Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models and a class frequency mixture model (cF) is implemented in a freely available program called QmmRAxML for phylogenetic estimation.
Accurate estimation of gene evolutionary rates using XRATE, with an application to transmembrane proteins.
TLDR
The first tests of XRATE are reported as a precise quantitative instrument for estimating evolutionary rates, implementing a codon model similar to that of Goldman and Yang (1994) (A codon-based model of nucleotide substitution for protein-coding DNA sequences).
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families
TLDR
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced, which can improve the quality of HMMs produced from small training sets.
Modeling residue usage in aligned protein sequences via maximum likelihood.
  • W. Bruno
  • Biology
    Molecular biology and evolution
  • 1996
TLDR
The ability of this method to discard misleading phylogenetic effects allows the biochemical propensities of different positions in a sequence to be more clearly observed and interpreted.
Evolutionary HMMs: a Bayesian approach to multiple alignment
TLDR
A multiple alignment algorithm for Bayesian inference in the links model proposed by Thorne et al. is developed, finding that the mean sum-of-pairs score for the BAliBASE alignments is only 13% lower for Handelthan for CLUSTALW, despite the relative simplicity of the link model.
Amino acid substitution matrices from protein blocks.
  • S. Henikoff, J. Henikoff
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 1992
TLDR
This work has derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, leading to marked improvements in alignments and in searches using queries from each of the groups.
Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology
TLDR
This paper corrects the previously published formula for estimating expected amino acid probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
A structural EM algorithm for phylogenetic inference
TLDR
This paper describes a new algorithm that uses Structural-EM for learning maximum likelihood trees, and proves that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge.
Modeling evolution at the protein level using an adjustable amino acid fitness model.
TLDR
An adjustable fitness model for amino acid site substitutions is investigated and when optimized it outperforms mtREV in likelihood analysis on protein-coding mitochondrial genes and shows correspondence to some biophysical characteristics of amino acids.
RNA secondary structure prediction using stochastic context-free grammars and evolutionary history
TLDR
A method which incorporates evolutionary history into RNA secondary structure prediction, based on stochastic context-free grammars to give a prior probability distribution of structures, which performs very well compared to current automated methods.
A comprehensive comparison of multiple sequence alignment programs
TLDR
This paper presents the first systematic study of the most commonly used alignment programs using BAliBASE benchmark alignments as test cases, and proposes appropriate alignment strategies, depending on the nature of a particular set of sequences.
Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies.
TLDR
A codon-level model of coding sequence evolution in which position-specific amino acid frequencies are free parameters is introduced, which produces linear distance estimates over a wide range of distances, while several alternative models underestimate long distances relative to short distances.
...
1
2
3
...