• Corpus ID: 16286225

Compression and the origins of Zipf's law of abbreviation

  title={Compression and the origins of Zipf's law of abbreviation},
  author={Ramon Ferrer-i-Cancho and Christian Bentz and Caio Seguin},
Languages across the world exhibit Zipf's law of abbreviation, namely more frequent words tend to be shorter. The generalized version of the law - an inverse relationship between the frequency of a unit and its magnitude - holds also for the behaviours of other species and the genetic code. The apparent universality of this pattern in human language and its ubiquity in other domains calls for a theoretical understanding of its origins. To this end, we generalize the information theoretic… 

Figures from this paper

Compression and the origins of Zipf's law for word frequencies

A new derivation of Zipf's law for word frequencies based on optimal coding that sheds light on the origins of other statistical laws of language and thus can lead to a compact theory of linguistic laws.

Zipf's law of abbreviation as a language universal

It is argued that this universal trend of words that are used more frequently tend to be shorter is likely to derive from fundamental principles of information processing and transfer.

The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies

A new perspective to establish a connection between different statistical linguistic laws is presented, and a possible model-free explanation for the origin of Zipf's law is found, which should arise as a mixture of conditional frequency distributions governed by the crossover length-dependent frequency.

The evolution of optimized language in the light of standard information theory

Extensions of standard information theory predict that in case of optimal coding, the correlation between word frequency and word length cannot be positive and, in general, it is expected to be negative in concordance with Zipf’s law of abbreviation.

The Entropy of Words - Learnability and Expressivity across More than 1000 Languages

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally.

Linguistic laws in chimpanzee gestural communication

A negative correlation between number and mean duration of gestures in sequences, in line with Menzerath's law is found, providing the first evidence that compression underpins animal gestural communication, and highlight an important commonality between primate gesturing and language.

Gelada vocal sequences follow Menzerath’s linguistic law

In vocal sequences of wild male geladas (Theropithecus gelada), construct size is negatively correlated with constituent size (duration of calls) and formal mathematical support is provided for the idea that Menzerath’s law reflects compression—the principle of minimizing the expected length of a code.

The placement of the head that maximizes predictability. An information theoretic approach

This paper adds a competing word order principle: the maximization of predictability of a target element to the minimization of the length of syntactic dependencies from the perspective of information theory.

Parallels of human language in the behavior of bottlenose dolphins

Dolphins exhibit striking similarities with humans, and various statistical laws of language that are well-known in quantitative linguistics, i.e. Zipf’s law for word frequencies, the law of meaning distribution, and Menzerath's, law have been found in dolphin vocal or gestural behavior.



Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited*

It is made evident that word frequency as a function of the rank follows two different exponents, ˜(-)1 for the first regime and ™(-)2 for the second.

Languages cool as they expand: Allometric scaling and the decreasing need for new words

The annual growth fluctuations of word use has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion.

Compression as a Universal Principle of Animal Behavior

It is shown that minimizing the expected code length implies that the length of a word cannot increase as its frequency increases, which means that the mean code length or duration is significantly small in human language, and also in the behavior of other species in all cases where agreement with the law of brevity has been found.

Zipf's Law and Random Texts

It is shown that real texts fill the lexical spectrum much more efficiently and regardless of the word length, suggesting that the meaningfulness of Zipf's law is high.

Zipf's law from a communicative phase transition

It is supported that Zipf's law in a communication system may maximize the information transfer under constraints and be specially suitable for the speech of schizophrenics.

Random texts exhibit Zipf's-law-like word frequency distribution

It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as English. The facts that the frequency of

Rank Diversity of Languages: Generic Behavior in Computational Linguistics

A measure of how word ranks change in time is introduced and this diversity is calculated for books published in six European languages since 1800, and it is found that it follows a universal lognormal distribution.

Word lengths are optimized for efficient communication

It is shown across 10 languages that average information content is a much better predictor of word length than frequency, which indicates that human lexicons are efficiently structured for communication by taking into account interword statistical dependencies.

Word Length and Word Frequency

Since the appearance of Zipf’s works, his hypothesis “that the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship to the number of occurrences” has been generally accepted.

Statistical laws in linguistics

It is argued that linguistic laws are only meaningful if accompanied by a model for which the fluctuations can be computed (e.g., a generative model of the text) and the constraints imposed by linguistic laws on the creativity process of text generation are not as tight as one could expect.