Language trees and zipping.

@article{Benedetto2002LanguageTA,
  title={Language trees and zipping.},
  author={Dario Benedetto and Emanuele Caglioti and Vittorio Loreto},
  journal={Physical review letters},
  year={2002},
  volume={88 4},
  pages={
          048702
        }
}
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification. 

Figures and Tables from this paper

Dictionary based methods for information extraction

Artificial sequences and complexity measures

TLDR
A class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters based on their relative information content are introduced.

Dictionary-basedmethod s for information extraction

TLDR
A procedure of string comparison between dictionary-created sequences (or arti+cial texts) that gives very good results in several contexts and some results on self-consistent classi5cation problems are presented.

Automatic language identification using multivariate analysis

TLDR
A language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification is presented and its performance is compared with existing schemes viz., the N-grams and compression.

Automatic Language Identification Using Multivariate Analysis

TLDR
A language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification is presented and its performance is compared with existing schemes viz., the N-grams and compression.

Dictionary-based methods for information extraction

TLDR
A general method for information extraction that exploits the features of data compression techniques and a procedure of string comparison between dictionary-created sequences (or arti+cial texts) that gives very good results in several contexts.

Data Compression approach to Information Extraction and Classification

TLDR
A class of general methods for information extraction and automatic categorization exploit the features of data compression techniques in order to define a measure of syntactic remoteness between pairs of sequences of characters based on their relative informatic content.

Automatic Alphabet Recognition

TLDR
A vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates is developed that provides an efficient solution to the stated problem in most cases.

Use of Kolmogorov distance identification of web page authorship , topic and domain

TLDR
This work deals with the use of information entropy measures for author identification in online postings and the identification of WebPages that are related to each other.

Sublinear growth of information in DNA sequences

  • G. Menconi
  • Computer Science
    Bulletin of mathematical biology
  • 2005
...

References

SHOWING 1-10 OF 46 REFERENCES

Typical sequences and all that: entropy, pattern matching, and data compression

  • A. Wyner
  • Physics
    Proceedings of 1994 IEEE International Symposium on Information Theory
  • 1994
The author applies pattern matching results to three problems in information theory. The characterisation of a probability law is also discussed.<<ETX>>

Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series))

TLDR
The complexity of a string is defined as the shortest description of x, and a formal definition is given that is equivalent to the one in the book.

Advances in cladistics

TLDR
Reading is a hobby to open the knowledge windows and by this way, concomitant with the technology development, many companies serve the e-book or book in soft file.

Information, Randomness and Incompleteness - Papers on Algorithmic Information Theory; 2nd Edition

  • G. Chaitin
  • Computer Science
    World Scientific Series in Computer Science
  • 1990
The papers gathered in this book were published over a period of more than twenty years in widely scattered journals. They led to the discovery of randomness in arithmetic which was presented in the

Complexity, Entropy and the Physics of Information

That's it, a book to wait for in this month. Even you have wanted for long time for releasing this book complexity entropy and the physics of information; you may not be able to get in some stress.

Information

The Bell System Technical J

  • 27 379 and 623
  • 1948

Bell Syst

  • Tech. J. 27, 379 (1948); 27, 623
  • 1948