# Language trees and zipping.

@article{Benedetto2002LanguageTA, title={Language trees and zipping.}, author={Dario Benedetto and Emanuele Caglioti and Vittorio Loreto}, journal={Physical review letters}, year={2002}, volume={88 4}, pages={ 048702 } }

In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.

## 196 Citations

### Artificial sequences and complexity measures

- Computer Science
- 2005

A class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters based on their relative information content are introduced.

### Dictionary-basedmethod s for information extraction

- Computer Science
- 2004

A procedure of string comparison between dictionary-created sequences (or arti+cial texts) that gives very good results in several contexts and some results on self-consistent classi5cation problems are presented.

### Automatic language identification using multivariate analysis

- Computer Science
- 2005

A language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification is presented and its performance is compared with existing schemes viz., the N-grams and compression.

### Automatic Language Identification Using Multivariate Analysis

- Computer ScienceCICLing
- 2005

A language identification system that uses the Multivariate Analysis (MVA) for dimensionality reduction and classification is presented and its performance is compared with existing schemes viz., the N-grams and compression.

### Dictionary-based methods for information extraction

- Computer Science
- 2004

A general method for information extraction that exploits the features of data compression techniques and a procedure of string comparison between dictionary-created sequences (or arti+cial texts) that gives very good results in several contexts.

### Data Compression approach to Information Extraction and Classification

- Computer ScienceArXiv
- 2004

A class of general methods for information extraction and automatic categorization exploit the features of data compression techniques in order to define a measure of syntactic remoteness between pairs of sequences of characters based on their relative informatic content.

### Automatic Alphabet Recognition

- Computer ScienceInformation Retrieval
- 2004

A vector-space-based method that creates frequencies vectors for each letter of the language and then matches a new document's vectors to the pre-computed templates is developed that provides an efficient solution to the stated problem in most cases.

### Use of Kolmogorov distance identification of web page authorship , topic and domain

- Computer Science
- 2005

This work deals with the use of information entropy measures for author identification in online postings and the identification of WebPages that are related to each other.

### Sublinear growth of information in DNA sequences

- Computer ScienceBulletin of mathematical biology
- 2005

## References

SHOWING 1-10 OF 46 REFERENCES

### Typical sequences and all that: entropy, pattern matching, and data compression

- PhysicsProceedings of 1994 IEEE International Symposium on Information Theory
- 1994

The author applies pattern matching results to three problems in information theory. The characterisation of a probability law is also discussed.<<ETX>>

### Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series))

- MathematicsSIGACT News
- 1997

The complexity of a string is defined as the shortest description of x, and a formal definition is given that is equivalent to the one in the book.

### Advances in cladistics

- Computer Science
- 1983

Reading is a hobby to open the knowledge windows and by this way, concomitant with the technology development, many companies serve the e-book or book in soft file.

### Information and dynamical systems: a concrete measurement on sporadic dynamics

- Computer Science, Physics
- 2002

### Information, Randomness and Incompleteness - Papers on Algorithmic Information Theory; 2nd Edition

- Computer ScienceWorld Scientific Series in Computer Science
- 1990

The papers gathered in this book were published over a period of more than twenty years in widely scattered journals. They led to the discovery of randomness in arithmetic which was presented in the…

### Complexity, Entropy and the Physics of Information

- Education
- 1990

That's it, a book to wait for in this month. Even you have wanted for long time for releasing this book complexity entropy and the physics of information; you may not be able to get in some stress.…

### The Bell System Technical J

- 27 379 and 623
- 1948

### Bell Syst

- Tech. J. 27, 379 (1948); 27, 623
- 1948