A Unifying Theory of Distance from Calibration

  title={A Unifying Theory of Distance from Calibration},
  author={Jarosław Błasiok and Parikshit Gopalan and Lunjia Hu and Preetum Nakkiran},
We study the fundamental question of how to de ne and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We… 

Figures and Tables from this paper

On the Within-Group Discrimination of Screening Classifiers

It is argued that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest, and that this type of discrimination can be avoided if classifiers satisfy within- group monotonicity, a natural monotonicism property within each of the groups.

An Operational Perspective to Fairness Interventions: Where and How to Intervene

It is found predictive parity is difficult to achieve without using group data, and despite requiring group data during model training (but not inference), distributionally robust methods provide significant Pareto improvement.



T-Cal: An optimal test for the calibration of predictive models

This work considers detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem and proposes T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the ECE-Expected Calibration Error (ECE).

Mitigating bias in calibration error estimation

A simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function is proposed, which produces a less biased estimator of calibration error.

On Calibration of Modern Neural Networks

It is discovered that modern neural networks, unlike those from a decade ago, are poorly calibrated, and on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.

Calibration of Neural Networks using Splines

This work introduces a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions.

Uncertainty Quantification and Deep Ensembles

It is demonstrated that, although standard ensembling techniques certainly help to boost accuracy, the calibration of deep-ensembles relies on subtle trade-offs and, crucially, need to be executed after the averaging process.

Mix-n-Match: Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning

This paper proposes an alternative data-efficient kernel density-based estimator for a reliable evaluation of the calibration performance and proves its asymptotically unbiasedness and consistency.

Soft Calibration Objectives for Neural Networks

Overall, experiments across losses and datasets demonstrate that using calibrationsensitive procedures yield better uncertainty estimates under dataset shift than the standard practice of using a cross-entropy loss and post-hoc recalibration methods.

Revisiting the Calibration of Modern Neural Networks

It is shown that the most recent models, notably those not using convolutions, are among the best calibrated, and that architecture is a major determinant of calibration properties.

Generalization Error Bounds for Bayesian Mixture Algorithms

This paper considers the class of Bayesian mixture algorithms, where an estimator is formed by constructing a data-dependent mixture over some hypothesis space, and demonstrates that mixture approaches are particularly robust, and allow for the construction of highly complex estimators, while avoiding undesirable overfitting effects.