• Corpus ID: 202770409

Calibration tests in multi-class classification: A unifying framework

@article{Widmann2019CalibrationTI,
  title={Calibration tests in multi-class classification: A unifying framework},
  author={David Widmann and Fredrik Lindsten and Dave Zachariah},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.11385}
}
In safety-critical applications a probabilistic model is usually required to be calibrated, i.e., to capture the uncertainty of its predictions accurately. In multi-class classification, calibration of the most confident predictions only is often not sufficient. We propose and study calibration measures for multi-class classification that generalize existing measures such as the expected calibration error, the maximum calibration error, and the maximum mean calibration error. We propose and… 
On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers
TLDR
This work introduces: benchmarking on pseudo-real data where the true calibration map can be estimated very precisely; and novel calibration and evaluation methods using new calibration map families PL and PL3.
Trustworthy Deep Learning via Proper Calibration Errors: A Unifying Approach for Quantifying the Reliability of Predictive Uncertainty
TLDR
The framework of proper calibration errors is introduced, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties, which allows us to reliably estimate the calibration improvement of any injective recalibration method in an unbiased manner.
Top-label calibration
TLDR
A histogram binning algorithm is formalized that reduces top-label multiclass calibration to the binary case, it is proved that it has clean theoretical guarantees without distributional assumptions, and a methodical study of its practical performance is performed.
Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning
TLDR
The I-Max concept for binning is derived, which maximizes the mutual information between labels and binned (quantized) logits, and outperforms the state of the art calibration methods on various evaluation metrics across different benchmark datasets and models, even when using only a small set of calibration data.
Hidden Heterogeneity: When to Choose Similarity-Based Calibration
TLDR
HH can serve as a useful diagnostic tool for identifying when local calibration methods are needed, and improvements in calibration achieved by similarity-based calibration methods correlate with the amount of HH present and generally exceed calibrations achieved by global methods.
Calibrate: Interactive Analysis of Probabilistic Model Output
TLDR
Calibrate constructs a reliability diagram that is resistant to drawbacks in traditional approaches, and allows for interactive subgroup analysis and instance-level inspection, and is validated by presenting the results of a think-aloud experiment with data scientists who routinely analyze model calibration.
Calibration of Neural Networks using Splines
TLDR
This work introduces a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions.
Local Calibration: Metrics and Recalibration
TLDR
This work proposes the local calibration error (LCE), a novel local recalibration method that can be estimated sample-efficiently from data, and empirically finds that it reveals miscalibration modes that are more fine-grained than the ECE can detect.
Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration
TLDR
This work introduces a new notion—decision calibration—that requires the predicted distribution and true distribution to be “indistinguishable” to a set of downstream decision-makers, and designs a recalibration algorithms that requires sample complexity polynomial in the number of actions and theNumber of classes.
What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability
TLDR
It is argued that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed, and a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibrated error under different definitions of reliability are derived.
...
...

References

SHOWING 1-10 OF 42 REFERENCES
Evaluating model calibration in classification
TLDR
This work develops a general theoretical calibration evaluation framework grounded in probability theory, and points out subtleties present in model calibration evaluation that lead to refined interpretations of existing evaluation techniques.
On Calibration of Modern Neural Networks
TLDR
It is discovered that modern neural networks, unlike those from a decade ago, are poorly calibrated, and on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
Obtaining Well Calibrated Probabilities Using Bayesian Binning
TLDR
A new non-parametric calibration method called Bayesian Binning into Quantiles (BBQ) is presented which addresses key limitations of existing calibration methods and can be readily combined with many existing classification algorithms.
Verified Uncertainty Calibration
TLDR
The scaling-binning calibrator is introduced, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration, and estimates a model's calibration error more accurately using an estimator from the meteorological community.
Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings
TLDR
MMCE is presented, a RKHS kernel based measure of calibration that is efficiently trainable alongside the negative likelihood loss without careful hyperparameter tuning, and whose finite sample estimates are consistent and enjoy fast convergence rates.
A Kernel Two-Sample Test
TLDR
This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD).
Reliability, Sufficiency, and the Decomposition of Proper Scores
TLDR
It is demonstrated that resolution and reliability are directly related to forecast attributes which are desirable on grounds independent of the notion of scores, which can be considered an epistemological justification of measuring forecast quality by proper scores.
Estimating reliability and resolution of probability forecasts through decomposition of the empirical score
Proper scoring rules provide a useful means to evaluate probabilistic forecasts. Independent from scoring rules, it has been argued that reliability and resolution are desirable forecast attributes.
Rademacher and Gaussian Complexities: Risk Bounds and Structural Results
TLDR
This work investigates the use of certain data-dependent estimates of the complexity of a function class called Rademacher and Gaussian complexities and proves general risk bounds in terms of these complexities in a decision theoretic setting.
Increasing the Reliability of Reliability Diagrams
TLDR
A resampling method for assigning consistency bars to the observed frequencies is introduced that allows for immediate visual evaluation as to just how likely the observed relative frequencies are under the assumption that the predicted probabilities are reliable.
...
...