The frequency spectrum of finite samples from the intermittent silence process

@article{FerreriCancho2009TheFS,
  title={The frequency spectrum of finite samples from the intermittent silence process},
  author={Ramon Ferrer-i-Cancho and Ricard Gavald{\`a}},
  journal={J. Assoc. Inf. Sci. Technol.},
  year={2009},
  volume={60},
  pages={837-843}
}
It has been argued that the actual distribution of word frequencies could be reproduced or explained by generating a random sequence of letters and spaces according to the so-called intermittent silence process. The same kind of process could reproduce or explain the counts of other kinds of units from a wide range of disciplines. Taking the linguistic metaphor, we focus on the frequency spectrum, i.e., the number of words with a certain frequency, and the vocabulary size, i.e., the number of… 

Figures from this paper

Compression and the origins of Zipf's law for word frequencies
TLDR
A new derivation of Zipf's law for word frequencies based on optimal coding that sheds light on the origins of other statistical laws of language and thus can lead to a compact theory of linguistic laws.
Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
TLDR
It is concluded that the exponents of Zipf’s law are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies.
The origins of Zipf's meaning‐frequency law
TLDR
It is shown that a single assumption on the joint probability of a word and a meaning suffices to infer Zipf's meaning‐frequency law or relaxed versions, and can be justified as the outcome of a biased random walk in the process of mental exploration.
Zipf's law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.
The ubiquitous inverse relationship between word frequency and word rank is commonly known as Zipf's law. The theoretical underpinning of this law states that the inverse relationship yields
Optimization Models of Natural Communication
TLDR
Two important components of the family, namely the information theoretic principles and the energy function that combines them linearly, are reviewed from the perspective of psycholinguistics, language learning, information theory and synergetic linguistics.
Compression and the origins of Zipf's law of abbreviation
TLDR
This work generalizes the information theoretic concept of mean code length as a mean energetic cost function over the probability and the magnitude of the types of the repertoire and shows that the minimization of that cost function and a negative correlation between probability andThe magnitude of types are intimately related.
Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
TLDR
It is suggested that Zipf's law might in fact be a fundamental law in natural languages because it is demonstrated that ranks derived from random texts and ranksderived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text.
Information content versus word length in random typing
TLDR
The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter.
A paradoxical property of the monkey book
TLDR
The somewhat counter-intuitive conclusion is that a 'monkey book' obeys Heaps' power law precisely because its word-frequency distribution is not a smoothPower law, contrary to the expectation based on simple mathematical arguments that if one is a power law, so is the other.
Compression as a Universal Principle of Animal Behavior
TLDR
It is shown that minimizing the expected code length implies that the length of a word cannot increase as its frequency increases, which means that the mean code length or duration is significantly small in human language, and also in the behavior of other species in all cases where agreement with the law of brevity has been found.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
Numerical Analysis of Word Frequencies in Artificial and Natural Language Texts
We perform a numerical study of the statistical properties of natural texts written in English and of two types of artificial texts. As statistical tools we use the conventional Zipf analysis of the
Zipf's Law and Random Texts
TLDR
It is shown that real texts fill the lexical spectrum much more efficiently and regardless of the word length, suggesting that the meaningfulness of Zipf's law is high.
Zipf's law from a communicative phase transition
TLDR
It is supported that Zipf's law in a communication system may maximize the information transfer under constraints and be specially suitable for the speech of schizophrenics.
Hierarchical structures induce long-range dynamical correlations in written texts.
TLDR
It is concluded that hierarchical structures in text serve to create long-range correlations, and use the reader's memory in reenacting some of the multidimensionality of the thoughts being expressed.
The appropriate use of Zipf's law in animal communication studies
On the law of Zipf-Mandelbrot for multi-word phrases
This article studies the probabilities of the occurrence of multi-word (m-word) phrases (m = 2,3,... ) in relation to the probabilities of occurrence of the single words. It is well known that, in
Random texts exhibit Zipf's-law-like word frequency distribution
It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as English. The facts that the frequency of
The use of Zipf's law in animal communication analysis
Finitary models of language users
TLDR
It is proposed to describe talkers and listeners to describe the users of language rather than the language itself, just as the authors' knowledge of arithmetic is not merely the collection of their arithmetic responses, habits, or dispositions.
The Frequency Spectrum of Text and Vocabulary
TLDR
Some problems of the analysis of the word‐frequency distribution and the possibility of its analytical description are dealt with.
...
...