Metrics of calibration for probabilistic predictions

  title={Metrics of calibration for probabilistic predictions},
  author={Imanol Arrieta Ibarra and Paman Gujral and Jonathan Tannen and Mark Tygert and Cherie Xu},
Many predictions are probabilistic in nature; for example, a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, “reliability diagrams” (also known as “calibration plots”) help detect and diagnose statistically significant discrepancies — so-called “miscalibration” — between the predictions and the outcomes. The canonical reliability diagrams are based on histogramming the observed and expected values… 

A Unifying Theory of Distance from Calibration

Fundamental lower and upper bounds on measuring distance to calibration are established, and theoretical justi cation for preferring certain metrics (like Laplace kernel calibration) in practice is provided in practice.



Measuring Calibration in Deep Learning

A comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences.

T-Cal: An optimal test for the calibration of predictive models

This work considers detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem, and proposes T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the (cid:96) 2 -Expected Calibration Error (ECE).

Some Remarks on the Reliability of Categorical Probability Forecasts

Studies on forecast evaluation often rely on estimating limiting observed frequencies conditioned on specific forecast probabilities (the reliability diagram or calibration function). Obviously,

Verified Uncertainty Calibration

The scaling-binning calibrator is introduced, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration, and estimates a model's calibration error more accurately using an estimator from the meteorological community.

Mitigating bias in calibration error estimation

A simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function is proposed, which produces a less biased estimator of calibration error.

Calibration of Neural Networks using Splines

This work introduces a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions.

A graphical method of cumulative differences between two subpopulations

C cumulative methods for the common case in which no Score of any member of the subpopulations being compared is exactly equal to the score of any other member of either subpopulation are developed.

Cumulative deviation of a subpopulation from the full population

  • M. Tygert
  • Environmental Science
    J. Big Data
  • 2021
C cumulative deviation of the subpopulation from the full population as proposed in this paper sidesteps the problematic coarse binning and encode subpopulation deviation directly as the slopes of secant lines for the graphs.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Nonparametric comparison of regression functions