When do finite sample effects significantly affect entropy estimates?

@article{Wit1999WhenDF,
title={When do finite sample effects significantly affect entropy estimates?},
author={Thierry Dudok de Wit},
journal={The European Physical Journal B - Condensed Matter and Complex Systems},
year={1999},
volume={11},
pages={513-516}
}
• T. D. Wit
• Published 30 June 1999
• Computer Science
• The European Physical Journal B - Condensed Matter and Complex Systems
An expression is proposed for determining the error made by neglecting finite sample effects in entropy estimates. It is based on the Ansatz that the ranked distribution of probabilities tends to follow a Zipf scaling.
14 Citations

Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

• Mathematics
NIPS
• 2012
A family of continuous mixing measures is defined such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H, meaning the prior strongly determines the estimate in the under-sampled regime.

Bayesian entropy estimation for countable discrete distributions

• Mathematics, Computer Science
J. Mach. Learn. Res.
• 2014
This work considers the problem of estimating Shannon's entropy H from discrete data, in cases where the number of possible symbols is unknown or even countably infinite, and derives a family of continuous measures for mixing Pitman-Yor processes to produce an approximately flat prior over H.

Entropy estimates of small data sets

• Computer Science, Mathematics
• 2008
A new ‘balanced estimator’ for entropy functionals (Shannon, Rényi and Tsallis) specially devised to provide a compromise between low bias and small statistical errors, for short data series.

On the similarity of symbol frequency distributions with heavy tails

• Computer Science
ArXiv
• 2015
It is found that frequent words change more slowly than less frequent words and that $\alpha=2$ provides the most robust measure to quantify language change, a complete $\alpha$-spectrum of measures.

The organization of intrinsic computation: complexity-entropy diagrams and the diversity of natural information processing.

• Computer Science
Chaos
• 2008
This work uses complexity-entropy diagrams to analyze intrinsic computation in a broad array of deterministic nonlinear and linear stochastic processes, including maps of the interval, cellular automata, and Ising spin systems in one and two dimensions, Markov chains, and probabilistic minimal finite-state machines.

Information Analysis of DNA Sequences

This paper considers entropy as a measure of information by modifying the entropy expression to take into account the varying length of coding and non-coding sequences, and shows that introns carry nearly as much of information as exons, disproving the notion that they do not carry any information.

INFORMATION ANALYSIS OF DNA SEQUENCES

• Biology
• 2006
The problem of differentiating the informational content of coding (exons) and noncoding (introns) regions of a DNA sequence is one of the central problems of genomics. The introns are estimated to

An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis - Version 2

• Computer Science
ArXiv
• 2020
Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a computationally convenient alternative to two and multiple sequence alignments for many genomic,

Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

• Computer Science
Bioinform.
• 2018
KCH is the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster, and is a much needed addition to the growing number of algorithms and tools that use Map Reduce for bioinformatics core applications.

Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning

• Biology
Bioinform.
• 2015
This work presents the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is exhaustive and within the bounds dictated by the information-theoretic content of the sample sets, and informative for comparative epigenomics.

References

SHOWING 1-10 OF 19 REFERENCES

Complexity: Hierarchical Structures and Scaling in Physics

• Computer Science
• 1997
Part I. Phenomenology and Models: Examples of complex behaviour and Mathematical models and Thermodynamic formalism.

Phys. Lett. A

• Phys. Lett. A
• 1988

Phys

• Lett. A 128, 369
• 1988

Int. J. Theor. Physics

• Int. J. Theor. Physics
• 1996