• Corpus ID: 245877810

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

  title={Leveraging Unlabeled Data to Predict Out-of-Distribution Performance},
  author={S. Garg and Sivaraman Balakrishnan and Zachary Chase Lipton and Behnam Neyshabur and Hanie Sedghi},
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model’s confidence, predicting accuracy as the fraction of unlabeled examples for which model… 

Predicting Out-of-Distribution Error with Confidence Optimal Transport

The method, Confidence Optimal Transport (COT), provides robust estimates of a model's performance on a target domain and achieves state-of-the-art results on three benchmark datasets and outperforms existing methods by a large margin.

A Learning Based Hypothesis Test for Harmful Covariate Shift

This work defines harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model and uses the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data to detect HCS.

Estimating and Explaining Model Performance When Both Covariates and Labels Shift

A new distribution shift model, Sparse Joint Shift (SJS), is proposed, which considers the joint shift of both labels and a few features and unifies and generalizes several existing shift models including label shift and sparse covariate shift.

RankFeat: Rank-1 Feature Removal for Out-of-distribution Detection

RankFeat is proposed, a simple yet effective post hoc approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature.

Performance Prediction Under Dataset Shift

Empirical validation on a benchmark of ten tabular datasets shows that models based upon state-of-the-art shift detection metrics are not expressive enough to generalize to unseen domains, while Error Predictors bring a consistent improvement in performance prediction under shift.

Understanding new tasks through the lens of training data via exponential tilting

This work forms a distribution shift model based on the exponential tilt assumption and learns train data importance weights minimizing the KL divergence between labeled train and unlabeled target datasets, which can be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection.

Predicting Out-of-Distribution Error with the Projection Norm

This work proposes a metric— Projection Norm —to predict a model’s performance on out-of-distribution (OOD) data without access to ground truth labels and finds that it is the only approach that achieves non-trivial detection performance on adversarial examples.

Explanation Shift: Investigating Interactions between Models and Shifting Data Distributions

It is found that the modeling of explanation shifts can be a better indicator for detecting out-of-distribution model behaviour than state- of-the-art techniques.

A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges

This survey aims to provide a cross-domain and comprehensive review of numerous eminent works in respective areas while identifying their commonalities and discusses and shed light on future lines of research, intending to bring these fields closer together.

Domain Adaptation under Open Set Label Shift

We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change arbitrarily and a new class may arrive during deployment, but the



Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

A principled and practically effective framework that simultaneously addresses unsupervised accuracy estimation and error detection and iteratively learns an ensemble of models to identify mis-classified data points and performs self-training to improve the ensemble with the identified points.

Predicting with Confidence on Unseen Distributions

This investigation determines that common distributional distances, such as Frechet distance or Maximum Mean Discrepancy, fail to induce reliable estimates of performance under distribution shift, and finds that the proposed difference of confidences (DoC) approach yields successful estimates of a classifier’s performance over a variety of shifts and model architectures.

Mandoline: Model Evaluation under Distribution Shift

Empirical validation on NLP and vision tasks shows that Mandoline can estimate performance on the target distribution up to 3 × more accurately compared to standard baselines, and a density ratio estimation framework for the slices is described.

RATT: Leveraging Unlabeled Data to Guarantee Generalization

This work enables practitioners to certify generalization even when (labeled) holdout data is unavailable and provides insights into the relationship between random label noise and generalization.

Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy, and demonstrates that domain-discriminating approaches tend to be helpful for characterizing shifts qualitatively and determining if they are harmful.

Estimating Accuracy from Unlabeled Data: A Probabilistic Logic Approach

An efficient method to estimate the accuracy of classifiers using only unlabeled data is proposed, based on the intuition that when classifiers agree, they are more likely to be correct, and when the classifiers make a prediction that violates the constraints, at least one classifier must be making an error.

Estimating Generalization under Distribution Shifts via Domain-Invariant Representations

This work uses a set of domain-invariant predictors as a proxy for the unknown, true target labels, and enables self-tuning of domain adaptation models, and accurately estimates the target error of given models under distribution shift.

Understanding the Failure Modes of Out-of-Distribution Generalization

This work identifies the fundamental factors that give rise to why models fail this way in easy-to-learn tasks where one would expect these models to succeed, and uncovers two complementary failure modes.

Regularized Learning for Domain Adaptation under Label Shifts

We propose Regularized Learning under Label shifts (RLLS), a principled and a practical domain-adaptation algorithm to correct for shifts in the label distribution between a source and a target

Predicting Unreliable Predictions by Shattering a Neural Network

This work proposes not only a theoretical framework to reason about subfunction error bounds but also a pragmatic way of approximately evaluating it, which it applies to predicting which samples the network will not successfully generalize to.