Testing statistical significance scores of sequence comparison methods with structure similarity

@article{Hulsen2006TestingSS,
  title={Testing statistical significance scores of sequence comparison methods with structure similarity},
  author={Tim Hulsen and Jacob de Vlieg and Jack A. M. Leunissen and Peter M. A. Groenen},
  journal={BMC Bioinformatics},
  year={2006},
  volume={7},
  pages={444 - 444}
}
BackgroundIn the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been… 
TULIP software and web server : automatic classification of protein sequences based on pairwise comparisons and Z-value statistics
TLDR
A web server is developed allowing the local or online computation of TULIP trees based on the CSHP probabilities, and allows a classification of protein sequences based on pairwise alignments and following evolutionary assumptions.
Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores
TLDR
A model of evolution of sequences based on aging, as meant in Reliability Theory, using the fact that the amount of information shared between an initial sequence and the sequences in its lineage is a decreasing function of time is built, using a sequence alignment score.
Normalized global alignment for protein sequences.
Island method for estimating the statistical significance of profile-profile alignment scores
TLDR
The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization and has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.
Algorithms in comparative genomics
TLDR
The author has studied and established a simple prescription for obtaining a better phylogeny by improving the underlying alignments used in phylogeny reconstruction by improving upon Gotoh's iterative heuristic by iterating with maximum parsimony guide-trees.
Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities
TLDR
It is demonstrated, for the first time, that partition function match probabilities used for expected accuracy alignment, as done in Probalign, provide statistically significant improvement over current approaches for identifying distantly related RNA sequences in larger genomic segments.
Enhanced Sequence-Based Function Prediction Methods and Application to Functional Similarity Networks
TLDR
The network structure of gene functional space built by connecting proteins with functional similarity, similar to structures of protein-protein interaction networks and metabolic pathway networks is discussed.
Algorithms for the study of RNA and protein structure
TLDR
A system to automatically generate two-dimensional representations of protein structure that are particularly useful in analysing complex protein folds and a method for using these diagrams as an interface to the protein substructure search methods.
Graph-based methods for large-scale protein classification and orthology inference
TLDR
It is argued that establishing true orthologous relationships requires a phylogenetic approach which combines both trees and graphs (networks), reliable species phylogeny, genomic data for more than two species, and an insight into the processes of molecular evolution.
Ranking MEDLINE documents
TLDR
A new methodology is developed that enables the automation of the assessment process based on a multi-criteria ranking function that contemplates six factors and seems appropriate to retrieve relevant papers out of a huge repository such as MEDLINE.
...
...

References

SHOWING 1-10 OF 29 REFERENCES
Comparative accuracy of methods for protein sequence similarity search
TLDR
B Probabilistic Smith-Waterman (PSW), which is based on Hidden Markov models for a single sequence using a standard scoring matrix, and a new version of BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments are compared.
Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap
TLDR
An unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap is developed, showing that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased.
Improved tools for biological sequence comparison.
  • W. Pearson, D. Lipman
  • Biology, Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 1988
TLDR
Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.
Comparison of methods for searching protein sequence databases
  • W. Pearson
  • Computer Science
    Protein science : a publication of the Protein Society
  • 1995
TLDR
Search sensitivity with either the Smith‐Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45–55, and optimized gap penalties instead of the conventional PAM250 matrix.
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.
TLDR
The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST, and the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition.
Assessing sequence comparison methods with the average precision criterion
TLDR
This work finds that the low-complexity segment filtration procedure in BLAST actually harms its overall search quality and AP scores of different search methods are approximately in proportion of the logarithm of search time.
Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics
TLDR
This study provides the missing theoretical link between a Z-value cut-off used for an automatic clustering of putative orthologs and/or paralogs, and the corresponding statistical risk in such genome-scale comparisons (using non-biased or biased genomes).
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
TLDR
The extent to which the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure can detect evolutionary relationships between the members of the sequence database PDBD40-J is determined.
...
...