Prequential plug-in codes that achieve optimal redundancy rates even if the model is wrong

@article{Grnwald2010PrequentialPC,
  title={Prequential plug-in codes that achieve optimal redundancy rates even if the model is wrong},
  author={Peter Gr{\"u}nwald and Wojciech Kotlowski},
  journal={2010 IEEE International Symposium on Information Theory},
  year={2010},
  pages={1383-1387}
}
We analyse the prequential plug-in codes relative to one-parameter exponential families M. We show that if data are sampled i.i.d. from some distribution outside M, then the redundancy of any plug-in prequential code grows at rate larger than  ln n in the worst case. This means that plug-in codes, such as the Rissanen-Dawid ML code, may behave inferior to other important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which the redundancy is always  ln n + O(1). However… 

Figures from this paper

When Data Compression and Statistics Disagree: Two Frequentist Challenges for the Minimum Description Length Principle
TLDR
A modification of the standard MDL estimator that has been proposed in the literature, which goes against its data compression principles is discussed, and the basic properties of Renyi's dissimilarity measure for probability distributions are reviewed.
Sequential prediction under log-loss and misspecification
TLDR
Two general results for misspecified regret are shown: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecification regret between well-specified regrets with (asymptotically) close hypotheses classes.
Sequential normalized maximum likelihood in log-loss prediction
TLDR
This paper shows that for general exponential families, the regret is bounded by the familiar (k/2)logn and thus optimal up to O(1) and introduces an approximation to SNML, flattened maximum likelihood, much easier to compute that SNML itself, while retaining the optimal regret under some additional assumptions.
Minimum Description Length Revisited
This is an up-to-date introduction to and overview of the Minimum Description Length (MDL) Principle, a theory of inductive inference that can be applied to general problems in statistics, machine
Maximum Likelihood vs. Sequential Normalized Maximum Likelihood in On-line Density Estimation
TLDR
It is shown for the first time that for general exponential families, the regret is bounded by the familiar (k=2) logn and thus optimal up to O(1) and the relationship to the Bayes strategy with Jereys’ prior is shown.
Laplace's Rule of Succession in Information Geometry
TLDR
It is proved that, for exponential families of distributions, such Bayesian predictors can be approximated by taking the average of the maximum likelihood predictor and thesequential normalized maximum likelihood from information theory, and it is possible to approximate Bayesian Predictors without the cost of integrating or sampling in parameter space.
Measuring Information Transfer in Neural Networks
TLDR
It is shown that L, the proposed Information Transfer, can be used as a measure of generalizable knowledge in a model or a dataset and can serve as an analytical tool in deep learning.
The Door
TLDR
This work focuses on a door because it links one place to another, and the linking of different places and sharing of places is one of the substantial qualities of network technology.
Following the Flattened Leader
TLDR
A simple “flattening” of the sequential ML and related predictors does achieve the optimal worst case individual sequence regret of (k/2) log n + O(1) for k parameter exponential family models for bounded outcome spaces; for unbounded spaces, the authors provide almost-sure results.

References

SHOWING 1-10 OF 18 REFERENCES
Asymptotic Log-Loss of Prequential Maximum Likelihood Codes
TLDR
It is shown that Dawid-Rissanen prequential maximum likelihood codes behave quite differently from other important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which c = 1.
MDL model selection using the ML plug-in code
TLDR
It is found that, in contrast to other important universal codes such as the 2-part MDL, Shtarkov and Bayesian codes where c = 1, here c equals the ratio between the variance of P and thevariance of the element of M that is closest to P in KL-divergence.
MDL model selection using the ML plug-in code
TLDR
It is found that, in contrast to other important universal codes such as the 2-part MDL, Shtarkov and Bayesian codes where c = 1, here c equals the ratio between the variance of P and thevariance of the element of M that is closest to P in KL-divergence.
Robustly Minimax Codes for Universal Data Compression
TLDR
A universal code is proposed which asymptotically achieves the minimax value of the relative redundancy of the modified Jeffreys mixture, which was introduced by Takeuchi and Barron and is minimax for regret.
Universal coding, information, prediction, and estimation
A connection between universal codes and the problems of prediction and statistical estimation is established. A known lower bound for the mean length of universal codes is sharpened and generalized,
An Empirical Study of MDL Model Selection with Infinite Parametric Complexity
TLDR
A Bayesian model with the improper Jeffreys’ prior is the most dependable in MDL model selection; a restricted NML model performs quite well but it is questionable if the results validate its theoretical motivation.
The Minimum Description Length Principle in Coding and Modeling
TLDR
The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms.
Iterated logarithmic expansions of the pathwise code lengths for exponential families
TLDR
For exponential families the authors obtain pathwise expansions, to the constant order, of the predictive and mixture codelengths used in MDL, useful for understanding different MDL forms.
Sequential probability assignment via online convex programming using exponential families
TLDR
An algorithm is drawn upon that does not require computing posterior distributions given all current observations, involves simple primal-dual parameter updates, and achieves minimax per-round regret against slowly varying product distributions with marginals drawn from the same exponential family.
...
...