Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies.

  title={Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies.},
  author={Anne L. Halpern and William J. Bruno},
  journal={Molecular biology and evolution},
  volume={15 7},
Estimation of evolutionary distances from coding sequences must take into account protein-level selection to avoid relative underestimation of longer evolutionary distances. Current modeling of selection via site-to-site rate heterogeneity generally neglects another aspect of selection, namely position-specific amino acid frequencies. These frequencies determine the maximum dissimilarity expected for highly diverged but functionally and structurally conserved sequences, and hence are crucial… 

Figures from this paper

Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates

Codon-level and amino-acid-level analysis frameworks are directly comparable and yield very similar inferences and the relationship between Rate4Site and dN∕dS is elucidated.

Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles

A probabilistic model is proposed that accounts for the heterogeneity of amino acid fitness profiles across the coding positions of a gene and is applied to a dozen real protein-coding gene alignments and finds it to produce biologically plausible inferences.

Modeling site-specific amino-acid preferences deepens phylogenetic estimates of viral sequence divergence

It is found that models informed by experimentally measured site-specific amino-acid preferences estimate longer deep branches on phylogenies of influenza virus hemagglutinin, underscores the importance of modeling site- specific amino- acid preferences when estimating deep divergence times—but shows the inherent limitations of approaches that fail to account for how these preferences shift over time.

Population Genetics Based Phylogenetics Under Stabilizing Selection for an Optimal Amino Acid Sequence: A Nested Modeling Approach

A new phylogenetic approach SelAC (Selection on Amino acids and Codons), whose substitution rates are based on a nested model linking protein expression to population genetics, indicates there is great potential for more accurate inference of phylogenetic trees and branch lengths from already existing data through the use of nested, mechanistic models.

Theory of measurement for site-specific evolutionary rates in amino-acid sequences

This work develops a theory of measurement for site-specific evolutionary rates, by analytically solving the maximum-likelihood equations for rate inference performed on sequences evolved under a mutation–selection model and uses misspecification as a deliberate strategy to result in robust and meaningful parameter inference.

Site-Specific Amino Acid Preferences Are Mostly Conserved in Two Closely Related Protein Homologs

It is found that site-specific evolutionary models informed by the experiments greatly outperformed nonsite-specific alternatives in fitting phylogenies of nucleoproteins from human, swine, equine, and avian influenza.

An Improved Codon Modeling Approach for Accurate Estimation of the Mutation Bias

An improved codon modeling approach where the fixation rate is not seen as a scalar anymore, but as a tensor unfolding along multiple directions, which gives an accurate representation of how mutation and selection oppose each other at equilibrium.

A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny

Protein families display site-specific evolutionary dynamics that are ignored by standard protein phylogenetic models and a class frequency mixture model (cF) is implemented in a freely available program called QmmRAxML for phylogenetic estimation.

Site-specific amino-acid preferences are mostly conserved in two closely related protein homologs

The results show that site-specific amino-acid preferences are sufficiently conserved that measuring mutational effects in one protein provides information that can improve quantitative evolutionary modeling of nearby homologs.

Physicochemical amino acid properties better describe substitution rates in large populations

A parametric codon model is proposed that distinguishes between radical and conservative substitutions, allowing us to assess if radical substitutions are preferentially removed by selection, and implies an important connection between the life history of a phylogenetic group and the model that best describes its evolution.



A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome.

Simulations help confirm previous suggestions that silent sites are saturated, leaving no evidence of heterogeneity in synonymous substitution rates, and confirm previous findings that substitution rates in the chloroplast genome are subject to both lineage-specific and locus-specific effects.

A codon-based model of nucleotide substitution for protein-coding DNA sequences.

Analyses of two data sets suggest that the new codon-based model can provide a better fit to data than can nucleotide-based models and can produce more reliable estimates of certain biologically important measures such as the transition/transversion rate ratio and the synonymous/nonsynonymous substitution rate ratio.

Codon substitution in evolution and the "saturation" of synonymous changes.

A mathematical model for codon substitution is presented, taking into account unequal mutation rates among different nucleotides and purifying selection, and it is shown that, when the mutation rates are not equal, the estimate of synonymous substitutions obtained by Perler et al. increases nonlinearly, although the true number of synonymous substitution increases linearly.

Estimation of Reversible Substitution Matrices from Multiple Pairs of Sequences

A weighting method for pairs of taxa related by a known tree that results in uniform weights for all branches and resembles one obtained using maximum likelihood, and the resulting distance measure is shown to have better linearity than is obtained in a less general model.

Using substitution probabilities to improve position-specific scoring matrices

This work introduces a simple method for computing pseudo-counts that combines the diversity observed in each alignment position with amino acid substitution probabilities and was a substantial improvement over the traditional average score method used for constructing profiles.

A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data.

  • P. Lewis
  • Biology
    Molecular biology and evolution
  • 1998
The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.

Combining protein evolution and secondary structure.

An evolutionary model that combines protein secondary structure and amino acid replacement is introduced. It allows likelihood analysis of aligned protein sequences and does not require the

Amino acid substitution matrices from protein blocks.

  • S. HenikoffJ. Henikoff
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 1992
This work has derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, leading to marked improvements in alignments and in searches using queries from each of the groups.

A Hidden Markov Model approach to variation among sites in rate of evolution.

The method of Hidden Markov Models is used to allow for unequal and unknown evolutionary rates at different sites in molecular sequences and it is shown how to use the Newton-Raphson method to estimate branch lengths of a phylogeny and to infer from a phylogenies what assignment of rates to sites has the largest posterior probability.

Hidden Markov models in computational biology. Applications to protein modeling.

The results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionary preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling.