Performance Prediction Under Dataset Shift

  title={Performance Prediction Under Dataset Shift},
  author={Simona Maggio and Victor Bouvier and Leo Dreyfus-Schmidt},
  journal={2022 26th International Conference on Pattern Recognition (ICPR)},
ML models deployed in production often have to face unknown domain changes, fundamentally different from their training settings. Performance prediction models carry out the crucial task of measuring the impact of these changes on model performance. We study the generalization capabilities of various performance prediction models to new domains by learning on generated synthetic perturbations. Empirical validation on a benchmark of ten tabular datasets shows that models based upon state-of-the… 



Learning Prediction Intervals for Model Performance

This work uses transfer learning to train an uncertainty model to estimate the uncertainty of model performance predictions, and believes this result makes prediction intervals, and performance prediction in general, significantly more practical for real-world use.

Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data

This work proposes an approach to assist non-ML experts working with pretrained ML models with a performance predictor for pretrained black box models, which can be combined with the model, and automatically warns end users in case of unexpected performance drops.

Predicting with Confidence on Unseen Distributions

This investigation determines that common distributional distances, such as Frechet distance or Maximum Mean Discrepancy, fail to induce reliable estimates of performance under distribution shift, and finds that the proposed difference of confidences (DoC) approach yields successful estimates of a classifier’s performance over a variety of shifts and model architectures.

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Average Thresholded Confidence (ATC) is proposed, a practical method that learns a threshold on the model’s confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold.

To Annotate or Not? Predicting Performance Drop under Domain Shift

This paper investigates three families of methods (\mathcal{H}-divergence, reverse classification accuracy and confidence measures), shows how they can be used to predict the performance drop and study their robustness to adversarial domain-shifts.

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

A large-scale benchmark of existing state-of-the-art methods on classification problems and the effect of dataset shift on accuracy and calibration is presented, finding that traditional post-hoc calibration does indeed fall short, as do several other previous methods.

Underspecification Presents Challenges for Credibility in Modern Machine Learning

This work shows the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain, and shows that this problem appears in a wide variety of practical ML pipelines.

Are Labels Always Necessary for Classifier Accuracy Evaluation?

  • Weijian DengLiang Zheng
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work constructs a meta-dataset: a dataset comprised of datasets generated from the original images via various transformations such as rotation, background substitution, foreground scaling, etc, and reports a reasonable and promising prediction of the model accuracy.

Measuring Robustness to Natural Distribution Shifts in Image Classification

It is found that there is often little to no transfer of robustness from current synthetic to natural distribution shift, and the results indicate that distribution shifts arising in real data are currently an open research problem.

Detecting and Correcting for Label Shift with Black Box Predictors

Black Box Shift Estimation (BBSE) is proposed to estimate the test distribution of p(y) and it is proved BBSE works even when predictors are biased, inaccurate, or uncalibrated, so long as their confusion matrices are invertible.