Word-length Entropies and Correlations of Natural Language Written Texts

Abstract We study the frequency distributions and correlations of the word lengths of 10 European languages. Our findings indicate that (a) the word-length distribution of short words quantified by the mean value and the entropy distinguishes the Uralic (Finnish) corpus from the others, (b) the tails at long words, manifested in the high-order moments of the distributions, differentiate the Germanic languages (except for English) from the Romanic languages and Greek and (c) the correlations… 
Evaluating the Irregularity of Natural Languages
The results revealed that real texts have non-trivial structure compared to the ones obtained from randomization procedures, as well as the multiscale entropy analysis.
Word-Length Correlations and Memory in Large Texts: A Visibility Network Analysis
This work studies the correlation properties of word lengths in large texts from 30 ebooks in the English language from the Gutenberg Project using the natural visibility graph method (NVG), and suggests that word lengths are much more strongly correlated at large distances between words than at short distances between Words.
Recurrence Networks in Natural Languages
The application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties, which show similar average values for density among languages which that belong to the same linguistic family.
Quantifying Evolution of Short and Long-Range Correlations in Chinese Narrative Texts across 2000 Years
It is speculated that the increase of word length and sentence length in written Chinese may account for this phenomenon, in terms of both the social-cultural aspects and the self-adapting properties of language structures.
Entropy in different text types
The present investigation is an attempt to investigate how the unique linguistic profile of different text types can be reflected in their respective entropy characteristics, and shows a strikingly similar distribution pattern in Chinese and English concerning the relative entropy of word-forms and POS-forms on different sentential positions.
A comparative study of power law scaling in large word-length sequences
A study of the correlation of lengths of words in large literary texts is presented. We use the statistical tools based on Allan factor and fractal dimension for estimating the fractal indices
Can the Probability Distribution of Dependency Distance Measure Language Proficiency of Second Language Learners?
This study corroborates that quantitative linguistic methods can be well utilized in second language acquisition researches by finding that the Zipf-Alekseev distribution well captures the probability distribution of dependency distance of each grade and native speakers.
Entropic Analysis of Garhwali Text
In the present study, a systematic statistical analysis has been performed by the use of words in continuous Garhwali speech corpus. The words of Garhwali in continuous speech corpus are taken from


Word Length in Portuguese Texts
The hypothesis that word length distributions in texts are not chaotic, but abiding by specific laws is already proven for many different languages and therefore it forms the basis for an examination like the present one.
Universal Entropy of Word Ordering Across Linguistic Families
A relative entropy measure is computed to quantify the degree of ordering in word sequences from languages belonging to several linguistic families to indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in theructure of language is a statistical linguistic universal.
Quantifying the information in the long-range order of words: Semantic structures and universal linguistic constraints
The study of word length has an almost 150-year long history: it was on August 18, 1851, when Augustus de Morgan, the well-known English mathematician and logician (1806–1871), in a letter to a
Word-Lengt Distribution in English Press Texts
This study examines if the same mathematical model applies to the distribution of word length in daily and weekly English press texts alike and if there is a difference between them at all.
GreekLex: A lexical database of Modern Greek
GreekLex, a lexical database for Modern Greek, is introduced, which presents collectively for the first time a series of orthographic measures that can be used for psycholinguistic research.
Basic Quantitative Characteristics of the Modern Greek Language Using the Hellenic National Corpus
Modern Greek is one of the least quantitatively studied modern European languages and the goal of this paper is to fill this relative void. We use the Hellenic National Corpus (HNC), which is a