# When do finite sample effects significantly affect entropy estimates?

@article{Wit1999WhenDF, title={When do finite sample effects significantly affect entropy estimates?}, author={Thierry Dudok de Wit}, journal={The European Physical Journal B - Condensed Matter and Complex Systems}, year={1999}, volume={11}, pages={513-516} }

An expression is proposed for determining the error made by neglecting finite sample effects in entropy estimates. It is based on the Ansatz that the ranked distribution of probabilities tends to follow a Zipf scaling.

## 14 Citations

### Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

- MathematicsNIPS
- 2012

A family of continuous mixing measures is defined such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H, meaning the prior strongly determines the estimate in the under-sampled regime.

### Bayesian entropy estimation for countable discrete distributions

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2014

This work considers the problem of estimating Shannon's entropy H from discrete data, in cases where the number of possible symbols is unknown or even countably infinite, and derives a family of continuous measures for mixing Pitman-Yor processes to produce an approximately flat prior over H.

### Entropy estimates of small data sets

- Computer Science, Mathematics
- 2008

A new ‘balanced estimator’ for entropy functionals (Shannon, Rényi and Tsallis) specially devised to provide a compromise between low bias and small statistical errors, for short data series.

### On the similarity of symbol frequency distributions with heavy tails

- Computer ScienceArXiv
- 2015

It is found that frequent words change more slowly than less frequent words and that $\alpha=2$ provides the most robust measure to quantify language change, a complete $\alpha$-spectrum of measures.

### The organization of intrinsic computation: complexity-entropy diagrams and the diversity of natural information processing.

- Computer ScienceChaos
- 2008

This work uses complexity-entropy diagrams to analyze intrinsic computation in a broad array of deterministic nonlinear and linear stochastic processes, including maps of the interval, cellular automata, and Ising spin systems in one and two dimensions, Markov chains, and probabilistic minimal finite-state machines.

### Information Analysis of DNA Sequences

- BiologyArXiv
- 2010

This paper considers entropy as a measure of information by modifying the entropy expression to take into account the varying length of coding and non-coding sequences, and shows that introns carry nearly as much of information as exons, disproving the notion that they do not carry any information.

### INFORMATION ANALYSIS OF DNA SEQUENCES

- Biology
- 2006

The problem of differentiating the informational content of coding (exons) and noncoding (introns) regions of a DNA sequence is one of the central problems of genomics. The introns are estimated to…

### An Extensible, Scalable Spark Platform for Alignment-free Genomic Analysis - Version 2

- Computer ScienceArXiv
- 2020

Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a computationally convenient alternative to two and multiple sequence alignments for many genomic,…

### Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms

- Computer ScienceBioinform.
- 2018

KCH is the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster, and is a much needed addition to the growing number of algorithms and tools that use Map Reduce for bioinformatics core applications.

### Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning

- BiologyBioinform.
- 2015

This work presents the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is exhaustive and within the bounds dictated by the information-theoretic content of the sample sets, and informative for comparative epigenomics.

## References

SHOWING 1-10 OF 19 REFERENCES

### Complexity: Hierarchical Structures and Scaling in Physics

- Computer Science
- 1997

Part I. Phenomenology and Models: Examples of complex behaviour and Mathematical models and Thermodynamic formalism.

### Phys. Lett. A

- Phys. Lett. A
- 1988

### Phys

- Lett. A 128, 369
- 1988

### Int. J. Theor. Physics

- Int. J. Theor. Physics
- 1996