Approaches to the classification of complex systems: Words, texts, and more

@article{Rovenchak2022ApproachesTT,
  title={Approaches to the classification of complex systems: Words, texts, and more},
  author={Andrij A. Rovenchak},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.04060}
}
The Chapter starts with introductory information about quantitative linguistics notions, like rank–frequency dependence, Zipf’s law, frequency spectra, etc. Similarities in distributions of words in texts with level occupation in quantum ensembles hint at a superficial analogy with statistical physics. This enables one to define various parameters for texts based on this physical analogy, including “temperature”, “chemical potential”, entropy, and some others. Such parameters provide a set of… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 171 REFERENCES
Zipf’s word frequency law in natural language: A critical review and future directions
TLDR
It is shown that human language has a highly complex, reliable structure in the frequency distribution over and above Zipf’s law, although prior data visualization methods have obscured this fact.
On the similarity of symbol frequency distributions with heavy tails
TLDR
It is found that frequent words change more slowly than less frequent words and that $\alpha=2$ provides the most robust measure to quantify language change, a complete $\alpha$-spectrum of measures.
Part-of-Speech Sequences in Literary Text: Evidence From Ukrainian
TLDR
It is shown that Zipf’s law holds for parts-of-speech sequences in Ukrainian texts by Ivan Franko, and it is expected that further studies of the proposed PoSW units both in Ukrainian and other languages can reveal new features of texts on the sentence and supra-sentence levels.
Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics.
TLDR
It is found that for vertebrates such as primates and rodents and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive.
Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
TLDR
It is suggested that Zipf's law might in fact be a fundamental law in natural languages because it is demonstrated that ranks derived from random texts and ranksderived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text.
Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited*
TLDR
It is made evident that word frequency as a function of the rank follows two different exponents, ˜(-)1 for the first regime and ™(-)2 for the second.
DEFINING THERMODYNAMIC PARAMETERS FOR TEXTS FROM WORD RANK-FREQUENCY DISTRIBUTIONS
We report the results regarding the calculation of a new parameter set obtained from the rank–frequency distribution of texts. The parameters are defined using the analogy between the rank–frequency
Zipf's Law and Random Texts
TLDR
It is shown that real texts fill the lexical spectrum much more efficiently and regardless of the word length, suggesting that the meaningfulness of Zipf's law is high.
On the Verge of Life: Distribution of Nucleotide Sequences in Viral RNAs
TLDR
It is observed that proximity of viruses on planes spanned on various pairs of parameters corresponds to related species in certain cases, and thus for the expansion of the set of parameters used in the classification of viruses.
...
...