Information Distance in Multiples

@article{Vitnyi2011InformationDI,
  title={Information Distance in Multiples},
  author={Paul M. B. Vit{\'a}nyi},
  journal={IEEE Transactions on Information Theory},
  year={2011},
  volume={57},
  pages={2451-2456}
}
  • P. Vitányi
  • Published 2011
  • Mathematics, Computer Science
  • IEEE Transactions on Information Theory
Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity and normalized information distance in multiples. We use the theoretical notion of Kolmogorov complexity which for practical purposes is approximated by the length of… Expand
Exact Expression For Information Distance
Information distance can be defined not only between two strings but also in a finite multiset of strings of cardinality greater than two. We give an elementary proof for expressing the informationExpand
Normalized Compression Distance of Multiples
TLDR
This work proposes an NCD of finite multisets (multiples) of objacts that is metric and is needed for many applications, using the theoretical notion of Kolmogorov complexity to approximate the length of the compressed version of the file involved, using a real-world compression program. Expand
Exact Expression For Information Distance
  • P. Vitányi
  • Mathematics, Computer Science
  • IEEE Transactions on Information Theory
  • 2017
TLDR
The upper bound on the information distance for all multisets is the same as the lower bound for infinitely many multiset of each of infinitely many cardinalities, up to a constant additive term. Expand
Compression-Based Similarity
  • P. Vitányi
  • Mathematics, Computer Science
  • 2011 First International Conference on Data Compression, Communications and Processing
  • 2011
TLDR
This work considers pair-wise distances for literal objects consisting of finite binary files, taken to contain all of their meaning, like genomes or books, and derives a similarity or relative semantics between names for objects. Expand
Compression-Based Similarity
TLDR
This work considers pair-wise distances for literal objects consisting of finite binary files, taken to contain all of their meaning, like genomes or books, and derives a similarity or relative semantics between names for objects. Expand
Web Similarity
TLDR
The derivation of the NWD method is based on Kolmogorov complexity and the theory and applications are developed and given. Expand
Normalized Compression Distance of Multisets with Applications
TLDR
This work proposes an NCD of multisets that is also metric and is superior to the pairwise NCD in accuracy and implementation complexity, and is applied to biological and OCR classification questions that were earlier treated with the pair wise NCD. Expand
Normalized Google Distance of Multisets with Applications
TLDR
This work proposes an NGD of finite multisets of search terms that is better for many applications and gives a relative semantics shared by a multiset of search Terms. Expand
Review of Expressions For Information Distance
TLDR
A review of the results expressing the information distance in conditional Kolmogorov complexity in a finite multiset of strings of cardinality greater than two. Expand
Similarity and denoising
  • P. Vitányi
  • Mathematics, Medicine
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
  • 2013
We can discover the effective similarity among pairs of finite objects and denoise a finite object using the Kolmogorov complexity of these objects. The drawback is that the Kolmogorov complexity isExpand
...
1
2
3
4
...

References

SHOWING 1-10 OF 78 REFERENCES
The similarity metric
TLDR
A new "normalized information distance" is proposed, based on the noncomputable notion of Kolmogorov complexity, and it is demonstrated that it is a metric and called the similarity metric. Expand
Clustering by compression
TLDR
Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported. Expand
Information Distance
TLDR
It is shown that the information distance is a universal cognitive similarity distance and investigated the maximal correlation of the shortest programs involved, the maximal uncorrelation of programs, and the density properties of the discrete metric spaces induced by the information distances. Expand
Information distance
TLDR
This work shows that the information distance is a universal cognitive similarity distance, and investigates the maximal correlation of the shortest programs involved, the maximal uncorrelation of programs, and the density properties of the discrete metric spaces induced by the information distances. Expand
The Normalized Compression Distance Is Resistant to Noise
TLDR
The influence of noise on the clustering of files of different types is explored, finding that the NCD performs well even in the presence of quite high noise levels. Expand
Application of compression-based distance measures to protein sequence classification: a methodological study
TLDR
Compression-based distance measures performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms. Expand
Shared information and program plagiarism detection
TLDR
A metric, based on Kolmogorov complexity, is proposed and proven to be universal in measuring the amount of shared information between two computer programs, to enable plagiarism detection and a practical system is designed and implemented that approximates this metric by a heuristic compression algorithm. Expand
Hierarchical Clustering Using Mutual Information
We present a conceptually simple method for hierarchical clustering of data called mutual information clustering (MIC) algorithm. It uses mutual information (MI) as a similarity measure and exploitsExpand
Language trees and zipping.
TLDR
A very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series based on data-compression techniques, featuring highly accurate results for language recognition, authorship attribution, and language classification. Expand
The Google Similarity Distance
TLDR
A new theory of similarity between words and phrases based on information distance and Kolmogorov complexity is presented, which is applied to construct a method to automatically extract similarity, the Google similarity distance, of Words and phrases from the WWW using Google page counts. Expand
...
1
2
3
4
5
...