A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

@article{Eddy2008APM,
  title={A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation},
  author={Sean R. Eddy},
  journal={PLoS Computational Biology},
  year={2008},
  volume={4}
}
  • S. Eddy
  • Published 1 May 2008
  • Biology
  • PLoS Computational Biology
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I… 

Figures from this paper

Significance of Gapped Sequence Alignments
  • L. Newberg
  • Biology, Computer Science
    J. Comput. Biol.
  • 2008
TLDR
This work draws random samples directly from a well chosen, importance-sampling probability distribution to approximate alignment score significance, and shows that the extreme value significance statistic for the local alignment model that is examined does not follow a Gumbel distribution.
Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling
TLDR
An efficient and general method to compute the score distribution to any desired accuracy, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles, and extended to a model of transmembrane proteins.
A new generation of homology search tools based on probabilistic inference.
  • S. Eddy
  • Computer Science
    Genome informatics. International Conference on Genome Informatics
  • 2009
TLDR
The aim in HMMER3 is to achieve BLAST's speed while further improving the power of probabilistic inference based methods, which aims to usher in a new generation of more powerful homology search tools based on probabilism inference methods.
How sequence alignment scores correspond to probability models
  • M. Frith
  • Biology, Computer Science
    bioRxiv
  • 2019
TLDR
This study shows how multiple models correspond to one set of scores and clarifies the statistical basis of sequence alignment, which involves judging whether whole sequences are related versus finding related parts.
Estimating statistical significance of local protein profile-profile alignments
TLDR
It is shown that improvements in statistical accuracy and sensitivity and high-quality alignment rate result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profile characteristics.
Island method for estimating the statistical significance of profile-profile alignment scores
TLDR
The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization and has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.
Accelerated Profile HMM Searches
  • S. Eddy
  • Computer Science
    PLoS Comput. Biol.
  • 2011
TLDR
An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.
Parameterizing sequence alignment with an explicit evolutionary model
TLDR
This work identifies and implements several probabilistic evolutionary models compatible with the affine-cost insertion/deletion model used in standard pairwise sequence alignment, including one evolutionary model compatible with symmetric pair HMMs that are the basis for Smith-Waterman pairwise alignment, and two evolutionary modelscompatible with standard profile-based alignment.
Remote homology search with hidden Potts models
TLDR
A hidden Potts model (HPM) is developed that merges a Potts emission process to a generative probability model of insertion and deletion so they can be applied to sequence alignment and remote homology search using a new model that is based on importance sampling.
Where Does the Alignment Score Distribution Shape Come from?
TLDR
A novel score probability distribution is obtained which is qualitatively very similar to that of Karlin-Altschul but performing better than all other previous model.
...
...

References

SHOWING 1-10 OF 60 REFERENCES
Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models
TLDR
The sensitivity of the hybrid method in the detection of sequence homology is found to be comparable to that of the Smith-Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment.
Statistical significance and extremal ensemble of gapped local hybrid alignment
A “semi-probabilistic” alignment algorithm which combines ideas from Smith-Waterman and probabilistic alignment is proposed and studied in detail. It is predicted that the score statistics of this
Calibrating E-values for hidden Markov models using reverse-sequence null models
TLDR
It is found that using a reverse-sequence null model effectively removes biases owing to sequence length and composition and reduces the number of false positives in a database search.
The estimation of statistical parameters for local alignment score distributions.
TLDR
This work describes a form of the recently described 'island' method in detail, and uses it to investigate the functional dependence of these parameters on finite-length edge effects.
Scoring hidden Markov models
TLDR
Among the null model choices, a simple looping null model that emits characters according to the geometric mean of the character probabilities in the columns modeled by the hidden Markov model (HMM) performs well or best across all four discrimination experiments.
Hybrid alignment: high-performance with universal statistics
TLDR
Preliminary results using the PfamA database suggest that the hybrid algorithm achieves similar performance as existing methods for position-specific scoring systems as well, and is established as a high performance alignment algorithm with well-characterized, universal statistics.
BALSA: Bayesian algorithm for local sequence alignment.
TLDR
A Bayesian algorithm for local sequence alignment (BALSA), that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters and all possible alignments.
Accurate formula for P-values of gapped local sequence and profile alignments.
  • R. Mott
  • Computer Science
    Journal of molecular biology
  • 2000
TLDR
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile, and investigates factors which effect the accuracy of alignment statistics.
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
TLDR
By identifying a complete set of linked clusters, "islands," this work devise a method which accurately predicts the extremal score statistics by using only one to a few pairwise alignments, and relies crucially on the link between the statistics of island scores and extremalscore statistics.
Rapid significance estimation in local sequence alignment with gaps
TLDR
A new algorithmic approach is presented which allows to estimate the more important of the Gumbel parameters at least five times faster than the traditional methods, and brings significance estimation into the realm of interactive applications.
...
...