The similarity metric

@article{Li2004TheSM,
  title={The similarity metric},
  author={Ming Li and Xin Chen and Xin Li and Bin Ma and Paul M. B. Vit{\'a}nyi},
  journal={IEEE Transactions on Information Theory},
  year={2004},
  volume={50},
  pages={3250-3264}
}
  • Ming Li, Xin Chen, P. Vitányi
  • Published 20 November 2001
  • Computer Science, Biology
  • IEEE Transactions on Information Theory
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new "normalized information distance," based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the similarity metric . This… 

Figures from this paper

The Universal Similarity Metric does not detect domain similarity
TLDR
An extensive test of the Universal Similarity Metric using a much larger and representative protein dataset shows that it has less domain discriminant power than any one of the methods considered by Sierk and Pearson.
Universal similarity
  • P. Vitányi
  • Computer Science
    IEEE Information Theory Workshop, 2005.
  • 2005
TLDR
A new area of parameter-free similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction is surveyed, based on compression and Google page counts related to search terms.
Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances
TLDR
An extensive test of the Universal Similarity Metric using a much larger and representative protein dataset shows that Krasnogor-Pelta method has less domain discriminant power than any one of the methods considered by Sierk and Pearson when using these simple contact maps.
Chapter 3 Normalized Information Distance
TLDR
This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations and presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.
Normalized Information Distance
TLDR
This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations and presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment
BackgroundSimilarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment
Clustering by compression
TLDR
A general mathematical theory of universal similarity is developed and tested on real-world applications in a wide range of fields, including the first completely automatic construction of the phylogeny tree based on whole mitochondrial genomes and a language tree for over 50 Euro-Asian languages.
Similarity of Objects and the Meaning of Words
We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family
Clustering by compression
TLDR
Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported.
...
...

References

SHOWING 1-10 OF 79 REFERENCES
Clustering by compression
TLDR
Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported.
The transformation distance: A dissimilarity measure based an movements of segments
TLDR
An algorithm which computes the transformation distance, which quantifies the dissimilarity between two sequences in term of segment-based events (without requiring a preliminary identiication of genes), and a biological application on Tnt1 tobacco retrotransposon is presented.
Transformation distances: a family of dissimilarity measures based on movements of segments
TLDR
An algorithm is presented which, given two sequences S and T, computes exactly and efficiently the transformation distance from S to T, which is able to account for duplications and translocations that cannot be properly described by sequence alignment.
An information-based sequence distance and its application to whole mitochondrial genome phylogeny
TLDR
A sequence distance that works on unaligned sequences using the information theoretical concept of Kolmogorov complexity and a program to estimate this distance is presented.
Normalized Forms for Two Common Metrics
TLDR
It is demonstrated that two common metrics, symmetric set diierence and Eu-clidian distance, have normalized forms which are nevertheless metrics, and these forms are qualitatively diierent from their unnormalized counterparts, and are therefore also distinguished from simpler range companded constructions.
A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract)
TLDR
A new technique, called hypercleaning, is presented that can be combined with various tree-building algorithms to efficiently reconstruct from sequence da ta the best supported edges of the evolutionary tree, and incorporates a detailed error model that relates errors in the data to the topology ofThe evolutionary tree.
Estimating true evolutionary distances between genomes
TLDR
This work presents a new technique called IEBP, for estimating the true evolutionary distance between two genomes, whether signed or unsigned, circular or linear, and for any relative probabilities of rearrangement event classes, which is highly accurate, as the simulation study shows.
Algorithmic clustering of music
We present a method for hierarchical music clustering, based on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is
Logical operations and Kolmogorov complexity. II
  • A. Muchnik, N. Vereshchagin
  • Computer Science, Mathematics
    Proceedings 16th Annual IEEE Conference on Computational Complexity
  • 2001
TLDR
There are two strings, whose mutual information is large but which have no common information in a strong sense, thus solving the problem posed by Muchnik et al. (1999) and an interpretation of both results in terms of Shannon entropy.
Genome phylogeny based on gene content
TLDR
This comprehensive genome phylogeny is independent of phylogenies based on the level of sequence identity of individual genes, and correlates with the standard reference of prokarytic phylogeny based on sequence similarity of 16s rRNA (ref. 4).
...
...