Empirical statistical estimates for sequence similarity searches.

@article{Pearson1998EmpiricalSE,
  title={Empirical statistical estimates for sequence similarity searches.},
  author={William R. Pearson},
  journal={Journal of molecular biology},
  year={1998},
  volume={276 1},
  pages={
          71-84
        }
}
  • W. Pearson
  • Published 13 February 1998
  • Biology
  • Journal of molecular biology
The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity… 

Figures and Tables from this paper

Estimation of P-values for global alignments of protein sequences
TLDR
The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.
A Simple Derivation of the Distribution of Pairwise Local Protein Sequence Alignment Scores
  • O. Bastien
  • Biology
    Evolutionary bioinformatics online
  • 2008
TLDR
It is demonstrated here that the Karlin-Altshul model can be derived with no reference to the extreme events theory.
Statistical Significance in Biological Sequence Comparison
TLDR
The chapter reviews the role of statistical significance estimates in biological sequence comparison, focusing on local similarity searches, and it is shown that, with the exception of highly biased protein sequences and sequences with low-complexity regions, real, unrelated protein sequences behave very similarly to sequences generated randomly.
Flexible sequence similarity searching with the FASTA3 program package.
  • W. Pearson
  • Biology
    Methods in molecular biology
  • 2000
The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments.
Homology-based method for identification of protein repeats using statistical significance estimates.
TLDR
An iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence and was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.
A structure-based method for protein sequence alignment
MOTIVATION With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that
Making Sense of Score Statistics for Sequence Alignments
TLDR
This paper aims to highlight a few of the principles that should be kept in mind when evaluating the statistical significance of alignments between sequences, and shows that the alignment statistics can undergo an abrupt phase transition.
Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison
A detailed analysis of the performance of hybrid, a new sequence alignment algorithm developed by Yu and coworkers that combines Smith Waterman local dynamic programming with a local version of the
Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap
TLDR
An unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap is developed, showing that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased.
Estimating statistical significance of local protein profile-profile alignments
TLDR
This study presents a methodology for estimating the statistical significance of profile-profile alignments and shows that improvements in statistical accuracy and sensitivity and alignment quality result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profiles.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Improved tools for biological sequence comparison.
  • W. Pearson, D. Lipman
  • Biology, Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 1988
TLDR
Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.
Rapid and accurate estimates of statistical significance for sequence data base searches.
  • M. Waterman, M. Vingron
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 1994
TLDR
This work presents a practical method to approximate the probability that a local alignment score is a result of chance alone, and presents applications to data base searching and the analysis of pairwise and self-comparisons of proteins.
Comparison of methods for searching protein sequence databases
  • W. Pearson
  • Computer Science
    Protein science : a publication of the Protein Society
  • 1995
TLDR
Search sensitivity with either the Smith‐Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45–55, and optimized gap penalties instead of the conventional PAM250 matrix.
Rapid and sensitive sequence comparison with FASTP and FASTA.
Applications and statistics for multiple high-scoring segments in molecular sequences.
  • S. Karlin, S. Altschul
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 1993
TLDR
The statistical distribution for the sum of the scores of multiple high-scoring segments is described and its application to the identification of possible transmembrane segments and the evaluation of sequence similarity is illustrated.
Effective protein sequence comparison.
Dynamic programming algorithms for biological sequence comparison.
Performance evaluation of amino acid substitution matrices
TLDR
Matrices derived directly from either sequence‐based or structurebased alignments of distantly related proteins performed much better overall than extrapolated matrices based on the Dayhoff evolutionary model.
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
  • S. Karlin, S. Altschul
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 1990
TLDR
Using an appropriate random model, this work presents a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score and examples are given of applications to a variety of protein sequences, highlighting segments with unusual biological features.
The significance of protein sequence similarities
TLDR
A general method of assessing the significance of scored best local alignments, particularly suited to protein sequence comparisons, is described, and the expected frequency of occurrence of any score can be calculated, together with the number of standard deviations above expectation, to provide sensible measures of significance.
...
1
2
3
...