Comparison of methods for searching protein sequence databases

@article{Pearson1995ComparisonOM,
  title={Comparison of methods for searching protein sequence databases},
  author={William R. Pearson},
  journal={Protein Science},
  year={1995},
  volume={4}
}
  • W. Pearson
  • Published 1 June 1995
  • Computer Science
  • Protein Science
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith‐Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45–55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the… 
Comparative accuracy of methods for protein sequence similarity search
TLDR
B Probabilistic Smith-Waterman (PSW), which is based on Hidden Markov models for a single sequence using a standard scoring matrix, and a new version of BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments are compared.
Increased Coverage Obtained by Combination of Methods for Protein Sequence Database Searching
TLDR
The union of results by BLAST (p-value) and FASTA at an equal p-value cutoff gave significantly better coverage than either method individually, and the best overall performance was obtained from the intersection of the results from SSEARCH and the GSRCH62 global alignment method.
Effective protein sequence comparison.
Empirical statistical estimates for sequence similarity searches.
  • W. Pearson
  • Biology
    Journal of molecular biology
  • 1998
The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.
TLDR
The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST, and the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition.
Comparing algorithms for large-scale sequence analysis
TLDR
This paper ported both Smith-Waterman and BLAST to the Frontier platform, enabling the efficient use of these algorithms on large sequence databases and presents a novel visualization tool along with quantitative metrics for comparing the results of alternative sequence alignment algorithms.
BALSA: Bayesian algorithm for local sequence alignment.
TLDR
A Bayesian algorithm for local sequence alignment (BALSA), that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters and all possible alignments.
Empirical determination of effective gap penalties for sequence comparison
TLDR
These gap penalties can improve expectation values by at least one order of magnitude when searching with short sequences, and improve the alignment of proteins containing short sequences repeated in tandem.
Flexible sequence similarity searching with the FASTA3 program package.
  • W. Pearson
  • Biology
    Methods in molecular biology
  • 2000
The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments.
Testing statistical significance scores of sequence comparison methods with structure similarity
TLDR
Two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith- waterman implementation with Z-score, and the compute intensive Z- score does not have a clear advantage over the e- value.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Performance evaluation of amino acid substitution matrices
TLDR
Matrices derived directly from either sequence‐based or structurebased alignments of distantly related proteins performed much better overall than extrapolated matrices based on the Dayhoff evolutionary model.
Improved tools for biological sequence comparison.
  • W. Pearson, D. Lipman
  • Biology, Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 1988
TLDR
Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.
Basic local alignment search tool.
A structural basis for sequence comparisons. An evaluation of scoring methodologies.
TLDR
A residue-exchange matrix has been derived that is suitable for comparison of amino acid sequences and the search for homologous sequences in amino acid databases and it is found that the matrix derived here is among the better performers in terms of alignment significance, detection of homologueous sequences andThe accuracy of alignments.
Profile analysis: detection of distantly related proteins.
TLDR
Tests with globin and immunoglobulin sequences show that profile analysis can distinguish all members of these families from all other sequences in a database containing 3800 protein sequences.
Sequence alignment and penalty choice. Review of concepts, case studies and implications.
A platform for biological sequence comparison on parallel computers
TLDR
The rapid FASTA sequence comparison algorithm and the more rigorous Smith-Waterman algorithm are implemented within this framework for similarity searching.
...
1
2
3
4
5
...