# Empirical statistical estimates for sequence similarity searches.

@article{Pearson1998EmpiricalSE, title={Empirical statistical estimates for sequence similarity searches.}, author={William R. Pearson}, journal={Journal of molecular biology}, year={1998}, volume={276 1}, pages={ 71-84 } }

The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity…

## Figures and Tables from this paper

## 352 Citations

Estimation of P-values for global alignments of protein sequences

- BiologyBioinform.
- 2001

The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation.

A Simple Derivation of the Distribution of Pairwise Local Protein Sequence Alignment Scores

- BiologyEvolutionary bioinformatics online
- 2008

It is demonstrated here that the Karlin-Altshul model can be derived with no reference to the extreme events theory.

Statistical Significance in Biological Sequence Comparison

- Biology
- 2004

The chapter reviews the role of statistical significance estimates in biological sequence comparison, focusing on local similarity searches, and it is shown that, with the exception of highly biased protein sequences and sequences with low-complexity regions, real, unrelated protein sequences behave very similarly to sequences generated randomly.

Flexible sequence similarity searching with the FASTA3 program package.

- BiologyMethods in molecular biology
- 2000

The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments.…

Homology-based method for identification of protein repeats using statistical significance estimates.

- BiologyJournal of molecular biology
- 2000

An iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence and was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.

A structure-based method for protein sequence alignment

- BiologyBioinform.
- 2005

MOTIVATION
With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that…

Making Sense of Score Statistics for Sequence Alignments

- Computer ScienceBriefings Bioinform.
- 2001

This paper aims to highlight a few of the principles that should be kept in mind when evaluating the statistical significance of alignments between sequences, and shows that the alignment statistics can undergo an abrupt phase transition.

Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison

- Computer ScienceProteins
- 2002

A detailed analysis of the performance of hybrid, a new sequence alignment algorithm developed by Yu and coworkers that combines Smith Waterman local dynamic programming with a local version of the…

Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap

- Computer ScienceBioinform.
- 2005

An unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap is developed, showing that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased.

Estimating statistical significance of local protein profile-profile alignments

- BiologybioRxiv
- 2018

This study presents a methodology for estimating the statistical significance of profile-profile alignments and shows that improvements in statistical accuracy and sensitivity and alignment quality result from statistically characterizing alignments by establishing the dependence of statistical parameters on various measures associated with both individual and pairwise profiles.

## References

SHOWING 1-10 OF 29 REFERENCES

Improved tools for biological sequence comparison.

- Biology, Computer ScienceProceedings of the National Academy of Sciences of the United States of America
- 1988

Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.

Rapid and accurate estimates of statistical significance for sequence data base searches.

- Computer ScienceProceedings of the National Academy of Sciences of the United States of America
- 1994

This work presents a practical method to approximate the probability that a local alignment score is a result of chance alone, and presents applications to data base searching and the analysis of pairwise and self-comparisons of proteins.

Comparison of methods for searching protein sequence databases

- Computer ScienceProtein science : a publication of the Protein Society
- 1995

Search sensitivity with either the Smith‐Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45–55, and optimized gap penalties instead of the conventional PAM250 matrix.

Rapid and sensitive sequence comparison with FASTP and FASTA.

- BiologyMethods in enzymology
- 1990

Applications and statistics for multiple high-scoring segments in molecular sequences.

- BiologyProceedings of the National Academy of Sciences of the United States of America
- 1993

The statistical distribution for the sum of the scores of multiple high-scoring segments is described and its application to the identification of possible transmembrane segments and the evaluation of sequence similarity is illustrated.

Dynamic programming algorithms for biological sequence comparison.

- Computer ScienceMethods in enzymology
- 1992

Performance evaluation of amino acid substitution matrices

- BiologyProteins
- 1993

Matrices derived directly from either sequence‐based or structurebased alignments of distantly related proteins performed much better overall than extrapolated matrices based on the Dayhoff evolutionary model.

Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

- BiologyProceedings of the National Academy of Sciences of the United States of America
- 1990

Using an appropriate random model, this work presents a theory that provides precise numerical formulas for assessing the statistical significance of any region with high aggregate score and examples are given of applications to a variety of protein sequences, highlighting segments with unusual biological features.

The significance of protein sequence similarities

- BiologyComput. Appl. Biosci.
- 1988

A general method of assessing the significance of scored best local alignments, particularly suited to protein sequence comparisons, is described, and the expected frequency of occurrence of any score can be calculated, together with the number of standard deviations above expectation, to provide sensible measures of significance.