# Calibration tests in multi-class classification: A unifying framework

@article{Widmann2019CalibrationTI, title={Calibration tests in multi-class classification: A unifying framework}, author={David Widmann and Fredrik Lindsten and Dave Zachariah}, journal={ArXiv}, year={2019}, volume={abs/1910.11385} }

In safety-critical applications a probabilistic model is usually required to be calibrated, i.e., to capture the uncertainty of its predictions accurately. In multi-class classification, calibration of the most confident predictions only is often not sufficient. We propose and study calibration measures for multi-class classification that generalize existing measures such as the expected calibration error, the maximum calibration error, and the maximum mean calibration error. We propose and…

## Figures and Tables from this paper

table 1 figure 1 figure 2 figure 3 figure 4 figure 5 figure 6 figure 7 figure 8 figure 9 figure 10 figure 11 figure 12 figure 13 figure 14 figure 15 figure 16 figure 17 figure 18 figure 19 figure 20 figure 21 figure 22 figure 23 figure 24 figure 25 figure 26 figure 27 figure 28 figure 29 figure 30 figure 31 figure 32 figure 33 figure 34 figure 35 figure 36 figure 37

## 41 Citations

On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers

- Computer ScienceArXiv
- 2022

This work introduces: benchmarking on pseudo-real data where the true calibration map can be estimated very precisely; and novel calibration and evaluation methods using new calibration map families PL and PL3.

Trustworthy Deep Learning via Proper Calibration Errors: A Unifying Approach for Quantifying the Reliability of Predictive Uncertainty

- Computer ScienceArXiv
- 2022

The framework of proper calibration errors is introduced, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties, which allows us to reliably estimate the calibration improvement of any injective recalibration method in an unbiased manner.

Top-label calibration

- Computer ScienceArXiv
- 2021

A histogram binning algorithm is formalized that reduces top-label multiclass calibration to the binary case, it is proved that it has clean theoretical guarantees without distributional assumptions, and a methodical study of its practical performance is performed.

Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning

- Computer ScienceICLR
- 2021

The I-Max concept for binning is derived, which maximizes the mutual information between labels and binned (quantized) logits, and outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, even when using only a small set of calibration data.

Hidden Heterogeneity: When to Choose Similarity-Based Calibration

- Computer ScienceArXiv
- 2022

HH can serve as a useful diagnostic tool for identifying when local calibration methods are needed, and improvements in calibration achieved by similarity-based calibration methods correlate with the amount of HH present and generally exceed calibrations achieved by global methods.

Calibrate: Interactive Analysis of Probabilistic Model Output

- Computer ScienceArXiv
- 2022

Calibrate constructs a reliability diagram that is resistant to drawbacks in traditional approaches, and allows for interactive subgroup analysis and instance-level inspection, and is validated by presenting the results of a think-aloud experiment with data scientists who routinely analyze model calibration.

Calibration of Neural Networks using Splines

- Computer ScienceICLR
- 2021

This work introduces a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions.

Local Calibration: Metrics and Recalibration

- Computer Science
- 2021

This work proposes the local calibration error (LCE), a novel local recalibration method that can be estimated sample-efficiently from data, and empirically finds that it reveals miscalibration modes that are more fine-grained than the ECE can detect.

Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration

- Computer ScienceNeurIPS
- 2021

This work introduces a new notion—decision calibration—that requires the predicted distribution and true distribution to be “indistinguishable” to a set of downstream decision-makers, and designs a recalibration algorithms that requires sample complexity polynomial in the number of actions and theNumber of classes.

What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability

- Computer Science
- 2022

It is argued that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed, and a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibrated error under different definitions of reliability are derived.

## References

SHOWING 1-10 OF 42 REFERENCES

Evaluating model calibration in classification

- Computer ScienceAISTATS
- 2019

This work develops a general theoretical calibration evaluation framework grounded in probability theory, and points out subtleties present in model calibration evaluation that lead to refined interpretations of existing evaluation techniques.

On Calibration of Modern Neural Networks

- Computer ScienceICML
- 2017

It is discovered that modern neural networks, unlike those from a decade ago, are poorly calibrated, and on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.

Obtaining Well Calibrated Probabilities Using Bayesian Binning

- Computer ScienceAAAI
- 2015

A new non-parametric calibration method called Bayesian Binning into Quantiles (BBQ) is presented which addresses key limitations of existing calibration methods and can be readily combined with many existing classification algorithms.

Verified Uncertainty Calibration

- Computer ScienceNeurIPS
- 2019

The scaling-binning calibrator is introduced, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration, and estimates a model's calibration error more accurately using an estimator from the meteorological community.

Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings

- Computer ScienceICML
- 2018

MMCE is presented, a RKHS kernel based measure of calibration that is efficiently trainable alongside the negative likelihood loss without careful hyperparameter tuning, and whose finite sample estimates are consistent and enjoy fast convergence rates.

A Kernel Two-Sample Test

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2012

This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).

Reliability, Sufficiency, and the Decomposition of Proper Scores

- Computer Science
- 2008

It is demonstrated that resolution and reliability are directly related to forecast attributes which are desirable on grounds independent of the notion of scores, which can be considered an epistemological justification of measuring forecast quality by proper scores.

Estimating reliability and resolution of probability forecasts through decomposition of the empirical score

- Environmental ScienceClimate Dynamics
- 2011

Proper scoring rules provide a useful means to evaluate probabilistic forecasts. Independent from scoring rules, it has been argued that reliability and resolution are desirable forecast attributes.…

Rademacher and Gaussian Complexities: Risk Bounds and Structural Results

- Computer ScienceJ. Mach. Learn. Res.
- 2001

This work investigates the use of certain data-dependent estimates of the complexity of a function class called Rademacher and Gaussian complexities and proves general risk bounds in terms of these complexities in a decision theoretic setting.

Increasing the Reliability of Reliability Diagrams

- Computer Science
- 2007

A resampling method for assigning consistency bars to the observed frequencies is introduced that allows for immediate visual evaluation as to just how likely the observed relative frequencies are under the assumption that the predicted probabilities are reliable.