# Alignment-Free Sequence Comparison (I): Statistics and Power

@article{Reinert2009AlignmentFreeSC,
title={Alignment-Free Sequence Comparison (I): Statistics and Power},
author={Gesine Reinert and David S. H. Chew and Fengzhu Sun and Michael S. Waterman},
journal={Journal of computational biology : a journal of computational molecular cell biology},
year={2009},
volume={16 12},
pages={
1615-34
}
}
• Published 1 December 2009
• Mathematics
• Journal of computational biology : a journal of computational molecular cell biology
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count…
189 Citations

## Figures and Tables from this paper

• Biology
J. Comput. Biol.
• 2010
The power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, is studied, both from a theoretical viewpoint and numerically, providing an easy to use program.
• Mathematics
J. Comput. Biol.
• 2014
This work derives exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrates through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest.
• Biology
J. Comput. Biol.
• 2013
The statistic d(s)(2) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.
• Computer Science
ArXiv
• 2021
By concentrating on histogram-based AF functions, this work performs the first coherent and uniform evaluation of the power of those functions, involving also Type I error for completeness, and provides a characterization of those AF functions that is novel and informative.
• Computer Science
Bioinform.
• 2022
By concentrating on a representative set of word-frequency based AF functions, the first coherent and uniform evaluation of the power of those AF functions is performed, involving also Type I error for completeness.
• Biology
• 2011
A new statistic is defined, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids, and the distribution function for various forms of this statistic for sequences of identically and independently distributed letters is simulated.
A number of alignment-free methods that are based on spaced word matches (‘SpaM’), i.e. on inexact word matches, that are able to contain mismatches at certain pre-defined positions based on stochastic models of molecular evolution are described.
• Computer Science
Briefings Bioinform.
• 2019
It is found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment, and paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences.

## References

SHOWING 1-10 OF 18 REFERENCES

• Biology
J. Comput. Biol.
• 2010
The power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, is studied, both from a theoretical viewpoint and numerically, providing an easy to use program.
• Computer Science
Pattern Recognit.
• 2009
• Biology, Computer Science
ISMB/ECCB
• 2007
The use of a new score for alignment-free sequence comparison, called the score, based on comparing the frequencies of all fixed-length words in the two sequences, which is highly successful in discriminating functionally related regulatory sequences from unrelated sequence pairs.
• Mathematics
Journal of Applied Probability
• 2007
Given two sequences of length n over a finite alphabet A of size |A| = d, the D 2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics
• Mathematics
Proceedings of the National Academy of Sciences of the United States of America
• 2002
Using an independence model of DNA sequences, the limiting distributions of D2 are derived by means of the Stein and Chen–Stein methods and three asymptotic regimes are identified, including compound Poisson and normal.
• Biology
BMC Bioinformatics
• 2006
The distribution of the D2 statistic at optimal word sizes is characterized and it is found that the best trade-off between computational efficiency and accuracy is obtained with exact word matches.
• Haiyan Huang
• Mathematics, Computer Science
Given two sequences over a finite alphabet $\mathcal{L}$, the $D_2$ statistic is the number of $m$-letter word matches between the two sequences. This statistic is used in bioinformatics for