• Corpus ID: 6168063

The word entropy of natural languages

  title={The word entropy of natural languages},
  author={Christian Bentz and Dimitrios Alikaniotis},
The average uncertainty associated with words is an information-theoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty - also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word… 

Figures and Tables from this paper

A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora

This paper uses human expert judgements from the World Atlas of Language Structures (WALS) to compare them to four quantitative measures automatically calculated from language corpora, and finds strong correlations between all the measures.

Morphological Complexity of Children Narratives in Eight Languages

The aim of this study was to compare the morphological complexity in a corpus representing the language production of younger and older children across different languages. The language samples were

InDetermination : Measuring Uncertainty in Social Science Texts

This paper proposes a method for measuring uncertainty in text with broad applicability to the social sciences and discusses several applications of the proposed approach to problems and techniques in social and political science.

Fast medical concept normalization for biomedical literature based on stack and index optimized self-attention

A hierarchical concept normalization method, named FastMCN, with much lower computational cost and a variant of transformer encoder, named stack and index optimized self-attention (SISA), to improve the efficiency and performance is proposed.

Molecule Generation by Principal Subgraph Mining and Assembling

This paper develops a novel notion, principal subgraph that is closely related to the informative pattern within molecules, and develops a two-step subgraph assembling strategy, which predicts a set of subgraphs in a sequence-wise manner and then assembles all generated sub graphs globally as the output molecule.

Memory and locality in natural language

Thesis: Ph. D. in Cognitive Science, Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, 2017.

Graph Piece: Efficiently Generating High-Quality Molecular Graphs with Substructures

This paper proposes a method to automatically discover common substructures, which are called graph pieces, from given molecular graphs, and presents a graph piece variational autoencoder (GP-VAE) for generating molecular graphs based on graph pieces.

Vocabulary Learning via Optimal Transport for Neural Machine Translation

This paper proposes VOLT, a simple and efficient solution without trial training that beats widely-used vocabularies in diverse scenarios, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation.



Universal Entropy of Word Ordering Across Linguistic Families

A relative entropy measure is computed to quantify the degree of ordering in word sequences from languages belonging to several linguistic families to indicate that despite the differences in the structure and vocabulary of the languages analyzed, the impact of word ordering in theructure of language is a statistical linguistic universal.

Measuring semantic content in distributional vectors

This paper investigates the hypothesis that semantic content can be computed using the KullbackLeibler (KL) divergence, an informationtheoretic measure of the relative entropy of two distributions and suggests that this result illustrates the rather ‘intensional’ aspect of distributions.

Complexity and universality in the long-range order of words

It is shown that a direct application of information theory leads to an entropy measure that can quantify semantic structures and extract keywords from linguistic samples, even without prior knowledge of the underlying language.

Automated Multiword Expression Prediction for Grammar Engineering

This paper proposes to semi-automatically detect MWE candidates in texts using some error mining techniques and validating them using a combination of the World Wide Web as a corpus and some statistical measures to provide a significant increase in the coverage of these expressions.

Europarl: A Parallel Corpus for Statistical Machine Translation

A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.

An Estimate of an Upper Bound for the Entropy of English

We present an estimate of an upper bound of 1.75 bits for the entropy of characters in printed English, obtained by constructing a word trigram model and then computing the cross-entropy between this

Discriminative Training and Maximum Entropy Models for Statistical Machine Translation

A framework for statistical machine translation of natural languages based on direct maximum entropy models, which contains the widely used source-channel approach as a special case and shows that a baseline statistical machinetranslation system is significantly improved using this approach.

Prediction and entropy of printed English

A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends

Prediction and Entropy of Printed English

A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends

Lexical typology through similarity semantics: Toward a semantic map of motion verbs

It is argued that the theoretical bases underlying probabilistic semantic maps from exemplar data are the isomorphism hypothesis, similarity semantics, and exemplar semantics (exemplar meaning is more fundamental than abstract concepts).