Alignment-Free Sequence Comparison (I): Statistics and Power
@article{Reinert2009AlignmentFreeSC, title={Alignment-Free Sequence Comparison (I): Statistics and Power}, author={Gesine Reinert and David S. H. Chew and Fengzhu Sun and Michael S. Waterman}, journal={Journal of computational biology : a journal of computational molecular cell biology}, year={2009}, volume={16 12}, pages={ 1615-34 } }
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count…
Figures and Tables from this paper
189 Citations
Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics
- BiologyJ. Comput. Biol.
- 2010
The power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, is studied, both from a theoretical viewpoint and numerically, providing an easy to use program.
The Distribution of Word Matches Between Markovian Sequences with Periodic Boundary Conditions
- MathematicsJ. Comput. Biol.
- 2014
This work derives exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrates through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest.
Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads
- BiologyJ. Comput. Biol.
- 2013
The statistic d(s)(2) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.
The Power of Alignment-Free Histogram-based Functions: a Comprehensive Genome Scale Experimental Analysis - Version 1
- Computer ScienceArXiv
- 2021
By concentrating on histogram-based AF functions, this work performs the first coherent and uniform evaluation of the power of those functions, involving also Type I error for completeness, and provides a characterization of those AF functions that is novel and informative.
The power of word-frequency-based alignment-free functions: a comprehensive large-scale experimental analysis
- Computer ScienceBioinform.
- 2022
By concentrating on a representative set of word-frequency based AF functions, the first coherent and uniform evaluation of the power of those AF functions is performed, involving also Type I error for completeness.
Weighted k-word matches: a sequence comparison tool for proteins
- Biology
- 2011
A new statistic is defined, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids, and the distribution function for various forms of this statistic for sequences of identically and independently distributed letters is simulated.
New powerful statistics for alignment-free sequence comparison under a pattern transfer model.
- MathematicsJournal of theoretical biology
- 2011
Extraction of high quality k-words for alignment-free sequence comparison.
- Computer ScienceJournal of theoretical biology
- 2014
Sequence Comparison without Alignment: The SpaM approaches
- Computer Science, BiologybioRxiv
- 2019
A number of alignment-free methods that are based on spaced word matches (‘SpaM’), i.e. on inexact word matches, that are able to contain mismatches at certain pre-defined positions based on stochastic models of molecular evolution are described.
A survey and evaluations of histogram-based statistics in alignment-free sequence comparison
- Computer ScienceBriefings Bioinform.
- 2019
It is found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment, and paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences.
References
SHOWING 1-10 OF 18 REFERENCES
Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics
- BiologyJ. Comput. Biol.
- 2010
The power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, is studied, both from a theoretical viewpoint and numerically, providing an easy to use program.
A statistical method for alignment-free comparison of regulatory sequences
- Biology, Computer ScienceISMB/ECCB
- 2007
The use of a new score for alignment-free sequence comparison, called the score, based on comparing the frequencies of all fixed-length words in the two sequences, which is highly successful in discriminating functionally related regulatory sequences from unrelated sequence pairs.
Asymptotic Behavior of k-Word Matches Between two Uniformly Distributed Sequences
- MathematicsJournal of Applied Probability
- 2007
Given two sequences of length n over a finite alphabet A of size |A| = d, the D 2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics…
Distributional regimes for the number of k-word matches between two random sequences
- MathematicsProceedings of the National Academy of Sciences of the United States of America
- 2002
Using an independence model of DNA sequences, the limiting distributions of D2 are derived by means of the Stein and Chen–Stein methods and three asymptotic regimes are identified, including compound Poisson and normal.
Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences
- BiologyBMC Bioinformatics
- 2006
The distribution of the D2 statistic at optimal word sizes is characterized and it is found that the best trade-off between computational efficiency and accuracy is obtained with exact word matches.
Error bounds on multivariate Normal approximations for word count statistics
- Mathematics, Computer ScienceAdvances in Applied Probability
- 2002
An explicit bound on the error made when approximating the multivariate distribution of U by the normal distribution is obtained, when the underlying sequence is i.i.d. or first-order stationary Markov over a finite alphabet.
The Power of Detecting Enriched Patterns: An HMM Approach
- BiologyJ. Comput. Biol.
- 2010
The issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif, is addressed.
Approximate word matches between two random sequences
- Mathematics
- 2008
Given two sequences over a finite alphabet $\mathcal{L}$, the $D_2$ statistic is the number of $m$-letter word matches between the two sequences. This statistic is used in bioinformatics for…
On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown
- Mathematics
- 1967
Abstract The standard tables used for the Kolmogorov-Smirnov test are valid when testing whether a set of observations are from a completely-specified continuous distribution. If one or more…