Entropy Analysis of Word-Length Series of Natural Language Texts: Effects of Text Language and Genre

@article{Kalimeri2012EntropyAO,
  title={Entropy Analysis of Word-Length Series of Natural Language Texts: Effects of Text Language and Genre},
  author={Maria Kalimeri and Vassilios Constantoudis and Constantinos Papadimitriou and Konstantinos Karamanos and Fotis K. Diakonos and Haris Papageorgiou},
  journal={ArXiv},
  year={2012},
  volume={abs/1401.4205}
}
We estimate the n-gram entropies of natural language texts in word-length representation and find that these are sensitive to text language and genre. We attribute this sensitivity to changes in the probability distribution of the lengths of single words and emphasize the crucial role of the uniformity of probabilities of having words with length between five and ten. Furthermore, comparison with the entropies of shuffled data reveals the impact of word length correlations on the estimated n… 

Figures and Tables from this paper

Word-length Entropies and Correlations of Natural Language Written Texts
TLDR
The findings indicate that the word-length distribution of short words quantified by the mean value and the entropy distinguishes the Uralic (Finnish) corpus from the others and the tails at long words differentiate the Germanic languages.
The Entropy of Words - Learnability and Expressivity across More than 1000 Languages
The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally.
Word-Length Correlations and Memory in Large Texts: A Visibility Network Analysis
TLDR
This work studies the correlation properties of word lengths in large texts from 30 ebooks in the English language from the Gutenberg Project using the natural visibility graph method (NVG), and suggests that word lengths are much more strongly correlated at large distances between words than at short distances between Words.
Quantifying Evolution of Short and Long-Range Correlations in Chinese Narrative Texts across 2000 Years
TLDR
It is speculated that the increase of word length and sentence length in written Chinese may account for this phenomenon, in terms of both the social-cultural aspects and the self-adapting properties of language structures.
A comparative study of power law scaling in large word-length sequences
A study of the correlation of lengths of words in large literary texts is presented. We use the statistical tools based on Allan factor and fractal dimension for estimating the fractal indices
Recurrence Networks in Natural Languages
TLDR
The application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties, which show similar average values for density among languages which that belong to the same linguistic family.
Long-range correlations and burstiness in written texts: Universal and language-specific aspects
Recently, methods from the statistical physics of complex systems have been applied successfully to identify universal features in the long-range correlations (LRCs) of written texts. However, in
APPLICATION OF WORD LENGTH USING THREE DISCRETE DISTRIBUTIONS (A Case Study of Students’ Research Projects)
This paper examined the application of word length using three discrete distributions. The study tends to estimate word length frequency distributions of five randomly selected students’ research
Markovian language model of the DNA and its information content
TLDR
The DNA can indeed be treated as a language, a Markovian language, where a ‘word’ is an element of a group, and its grammar represents the rules behind the probability of transitions between any two groups.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Long-range fractal correlations in literary corpora
TLDR
The results confirm that beyond the short-range correlations resulting from syntactic rules acting at sentence level, long-range structures emerge in large written language samples that give rise to long- range correlations in the use of words.
Prediction and entropy of printed English
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends
Foundations of statistical natural language processing
TLDR
This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Entropy of symbolic sequences: the role of correlations
TLDR
It is shown that sporadic systems give rise to peculiar scaling properties as a result of long-range correlations, and the potential implications of this possibility in the structure of natural languages are explored.
Language time series analysis
Entropy analysis of substitutive sequences revisited
TLDR
This paper proposes a simple criterion, based on measuring block entropies by lumping, which is satisfied by all automatic sequences, and establishes new entropic decimation schemes for the Thue–Morse, the Rudin–Shapiro and the paperfolding sequences read by Lumping.
...
...