Rudi Cilibrasi

Learn More
Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society" is "database," and the equivalent of "use" is "a way to search the database". We present a new theory of similarity between words and phrases based on information distance and(More)
We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise(More)
We present a new theory of relative semantics between objects, based on information distance and Kolmogorov complexity. This theory is then applied to construct a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of(More)
Cilibrasi, Vitányi, and de Wolf Computer Music Journal, 28:4, pp. 49–67, Winter 2004 2004 Massachusetts Institute of Technology Rudi Cilibrasi,* Paul Vitányi,*† and Ronald de Wolf* *Centrum voor Wiskunde en Informatica Kruislaan 413 1098 SJ Amsterdam, The Netherlands †Institute for Logic, Language, and Computation University of Amsterdam Plantage(More)
The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts,(More)
In this paper we present a collection of results pertaining to haplotyping. The first set of results concerns the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype data. More specifically, we show that an interesting, restricted case of Minimum Error Correction (MEC) is NP-hard, point out problems in(More)
We present several new results pertaining to haplotyping. These results concern the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype fragments. We consider the complexity of the problems Minimum Error Correction (MEC) and Longest Haplotype Reconstruction (LHR) for different restrictions on the input(More)
We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodiments like the first type, but may also be abstract like "red" or "Christianity". For the first type we consider a family of computable distance measures(More)