An expectation maximization algorithm for training hidden substitution models.
@article{Holmes2002AnEM,
title={An expectation maximization algorithm for training hidden substitution models.},
author={Ian H. Holmes and Gerald M. Rubin},
journal={Journal of molecular biology},
year={2002},
volume={317 5},
pages={
753-64
}
}We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a…
Figures, Tables, and Topics from this paper
94 Citations
Using evolutionary Expectation Maximization to estimate indel rates
- Computer ScienceBioinform.
- 2005
An algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91’ model) is presented.
Using guide trees to construct multiple-sequence evolutionary HMMs
- Computer Science, BiologyISMB
- 2003
This work presents general algorithms for constructing an Evolutionary HMM from any Pair HMM and for doing dynamic programming to any Multiple-sequence HMM.
An improved general amino acid replacement matrix.
- BiologyMolecular biology and evolution
- 2008
This method further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG.
Phylogenetic mixture models for proteins
- Computer Science, BiologyPhilosophical Transactions of the Royal Society B: Biological Sciences
- 2008
This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution and shows that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement Matrices.
Phylogenetic Motif Detection by Expectation-Maximization on Evolutionary Mixtures
- BiologyPacific Symposium on Biocomputing
- 2004
This work treats aligned DNA sequence as a mixture of evolutionary models, for motif and background, and provides an algorithm to estimate the parameters by Expectation-Maximization, which can take advantage of phylogenic information to avoid false positives and discover motifs upstream of groups of characterized target genes.
Empirical profile mixture models for phylogenetic reconstruction
- Computer Science, BiologyBioinform.
- 2008
An expectation-maximization algorithm for estimating amino acid profile mixtures from alignment databases is introduced and it is observed that a set of 20 profiles is enough to provide a better statistical fit than currently available empirical matrices (WAG, JTT), in particular on saturated data.
Pseudo-Likelihood Analysis of Codon Substitution Models with Neighbor-Dependent Rates
- Computer ScienceJ. Comput. Biol.
- 2005
The pseudo-likelihood estimates are shown to be very accurate, and from analyzing 348 human-mouse coding sequences it is concluded that the incorporation of a CpG effect improves the fit of the model considerably.
The Expectation Maximization (EM) algorithm and some of its applications in Molecular Biology
- Computer Science
- 2004
This tutorial gives a unified presentation of the application of EM to molecular biology problems, by first explaining a form of EM which can be used for clustering, and then showing through three examples how different problems can been formalized the same way, with mixtures of probabilistic models specific to each problem, used in combination with the same EM algorithm.
A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny
- BiologyBMC Evolutionary Biology
- 2008
Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models and a class frequency mixture model (cF) is implemented in a freely available program called QmmRAxML for phylogenetic estimation.
Accurate estimation of gene evolutionary rates using XRATE, with an application to transmembrane proteins.
- BiologyMolecular biology and evolution
- 2009
The first tests of XRATE are reported as a precise quantitative instrument for estimating evolutionary rates, implementing a codon model similar to that of Goldman and Yang (1994) (A codon-based model of nucleotide substitution for protein-coding DNA sequences).
References
SHOWING 1-10 OF 25 REFERENCES
Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families
- Computer ScienceISMB
- 1993
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced, which can improve the quality of HMMs produced from small training sets.
Modeling residue usage in aligned protein sequences via maximum likelihood.
- BiologyMolecular biology and evolution
- 1996
The ability of this method to discard misleading phylogenetic effects allows the biochemical propensities of different positions in a sequence to be more clearly observed and interpreted.
Evolutionary HMMs: a Bayesian approach to multiple alignment
- Computer Science, BiologyBioinform.
- 2001
A multiple alignment algorithm for Bayesian inference in the links model proposed by Thorne et al. is developed, finding that the mean sum-of-pairs score for the BAliBASE alignments is only 13% lower for Handelthan for CLUSTALW, despite the relative simplicity of the link model.
Amino acid substitution matrices from protein blocks.
- BiologyProceedings of the National Academy of Sciences of the United States of America
- 1992
This work has derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, leading to marked improvements in alignments and in searches using queries from each of the groups.
Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology
- MathematicsComput. Appl. Biosci.
- 1996
This paper corrects the previously published formula for estimating expected amino acid probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
A structural EM algorithm for phylogenetic inference
- Computer ScienceRECOMB
- 2001
This paper describes a new algorithm that uses Structural-EM for learning maximum likelihood trees, and proves that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge.
Modeling evolution at the protein level using an adjustable amino acid fitness model.
- BiologyPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
- 2000
An adjustable fitness model for amino acid site substitutions is investigated and when optimized it outperforms mtREV in likelihood analysis on protein-coding mitochondrial genes and shows correspondence to some biophysical characteristics of amino acids.
RNA secondary structure prediction using stochastic context-free grammars and evolutionary history
- Computer ScienceBioinform.
- 1999
A method which incorporates evolutionary history into RNA secondary structure prediction, based on stochastic context-free grammars to give a prior probability distribution of structures, which performs very well compared to current automated methods.
A comprehensive comparison of multiple sequence alignment programs
- Computer ScienceNucleic Acids Res.
- 1999
This paper presents the first systematic study of the most commonly used alignment programs using BAliBASE benchmark alignments as test cases, and proposes appropriate alignment strategies, depending on the nature of a particular set of sequences.
Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies.
- BiologyMolecular biology and evolution
- 1998
A codon-level model of coding sequence evolution in which position-specific amino acid frequencies are free parameters is introduced, which produces linear distance estimates over a wide range of distances, while several alternative models underestimate long distances relative to short distances.









