Language trees and zipping.

@article{Benedetto2002LanguageTA,
  title={Language trees and zipping.},
  author={Dario Benedetto and Emanuele Caglioti and Vittorio Loreto},
  journal={Physical review letters},
  year={2002},
  volume={88 4},
  pages={
          048702
        }
}
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification. 

Figures and Tables from this paper

Artificial sequences and complexity measures
TLDR
A class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters based on their relative information content are introduced.
Dictionary-basedmethod s for information extraction
TLDR
A procedure of string comparison between dictionary-created sequences (or arti+cial texts) that gives very good results in several contexts and some results on self-consistent classi5cation problems are presented.
Automatic language identification using multivariate analysis
TLDR
A language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification is presented and its performance is compared with existing schemes viz., the N-grams and compression.
Automatic Language Identification Using Multivariate Analysis
TLDR
A language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification is presented and its performance is compared with existing schemes viz., the N-grams and compression.
Dictionary-based methods for information extraction
TLDR
A general method for information extraction that exploits the features of data compression techniques and a procedure of string comparison between dictionary-created sequences (or arti+cial texts) that gives very good results in several contexts.
Data Compression approach to Information Extraction and Classification
TLDR
A class of general methods for information extraction and automatic categorization exploit the features of data compression techniques in order to define a measure of syntactic remoteness between pairs of sequences of characters based on their relative informatic content.
Automatic Alphabet Recognition
TLDR
A vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates is developed that provides an efficient solution to the stated problem in most cases.
Use of Kolmogorov distance identification of web page authorship , topic and domain
TLDR
This work deals with the use of information entropy measures for author identification in online postings and the identification of WebPages that are related to each other.
Sublinear growth of information in DNA sequences
  • G. Menconi
  • Computer Science
    Bulletin of mathematical biology
  • 2005
...
...