• Corpus ID: 235390418

What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?

  title={What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?},
  author={Weijian Deng and Stephen Gould and Liang Zheng},
  booktitle={International Conference on Machine Learning},
Understanding classifier decision under novel environments is central to the community, and a common practice is evaluating it on labeled test sets. However, in real-world testing, image annotations are difficult and expensive to obtain, especially when the test environment is changing. A natural question then arises: given a trained classifier, can we evaluate its accuracy on varying unlabeled test sets? In this work, we train semantic classification and rotation prediction in a multi-task way… 

Figures and Tables from this paper

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

A principled and practically effective framework that simultaneously addresses unsupervised accuracy estimation and error detection and iteratively learns an ensemble of models to identify mis-classified data points and performs self-training to improve the ensemble with the identified points.

Predicting with Confidence on Unseen Distributions

This investigation determines that common distributional distances, such as Frechet distance or Maximum Mean Discrepancy, fail to induce reliable estimates of performance under distribution shift, and finds that the proposed difference of confidences (DoC) approach yields successful estimates of a classifier’s performance over a variety of shifts and model architectures.

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Average Thresholded Confidence (ATC) is proposed, a practical method that learns a threshold on the model’s confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold.

Ranking Models in Unlabeled New Environments

The problem of ranking models in unlabeled new environments is introduced, a proxy dataset that is fully labeled and well reflects the true model rankings in a given target environment is proposed, and a carefully constructed proxy set effectively captures relative performance ranking in new environments.

Predicting is not Understanding: Recognizing and Addressing Underspecification in Machine Learning

This work formalizes the concept of underspecification and proposes a method to identify and partially address it, and trains multiple models with an independence constraint that forces them to implement different functions, resulting in a global model with superior OOD performance.

Performance Prediction Under Dataset Shift

Empirical validation on a benchmark of ten tabular datasets shows that models based upon state-of-the-art shift detection metrics are not expressive enough to generalize to unseen domains, while Error Predictors bring a consistent improvement in performance prediction under shift.

Label-Free Model Evaluation with Semi-Structured Dataset Representations

This work proposes a new semi-structured dataset representation that is manageable for regression learning while containing rich information for AutoEval, and integrates distribution shapes, clusters, and representative samples for a semi- structuring dataset representation.

Improving Self-supervised Learning for Out-of-distribution Task via Auxiliary Classifier

An end-to-end deep multi-task network is proposed that ex-hibits a clear improvement in semantic classification accuracy than other two baseline methods and has been validated through three unseen OOD datasets.

Estimating and Explaining Model Performance When Both Covariates and Labels Shift

A new distribution shift model, Sparse Joint Shift (SJS), is proposed, which considers the joint shift of both labels and a few features and unifies and generalizes several existing shift models including label shift and sparse covariate shift.

Predicting Out-of-Distribution Error with the Projection Norm

This work proposes a metric— Projection Norm —to predict a model’s performance on out-of-distribution (OOD) data without access to ground truth labels and finds that it is the only approach that achieves non-trivial detection performance on adversarial examples.



Are Labels Always Necessary for Classifier Accuracy Evaluation?

  • Weijian DengLiang Zheng
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work constructs a meta-dataset: a dataset comprised of datasets generated from the original images via various transformations such as rotation, background substitution, foreground scaling, etc, and reports a reasonable and promising prediction of the model accuracy.

Estimating Accuracy from Unlabeled Data: A Probabilistic Logic Approach

An efficient method to estimate the accuracy of classifiers using only unlabeled data is proposed, based on the intuition that when classifiers agree, they are more likely to be correct, and when the classifiers make a prediction that violates the constraints, at least one classifier must be making an error.

Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

It is shown that per-instance disagreement is an unbiased estimate of the variance of error for that instance, and that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper Bound on the "variance of prediction error", where variance is measured across training sets.

Estimating Accuracy from Unlabeled Data: A Bayesian Approach

A simple graphical model is presented that performs well in practice, and two nonparametric extensions to it that improve its performance are provided that outperform existing state-of-the-art solutions in both estimating accuracies, and combining multiple classifier outputs.

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

A large-scale benchmark of existing state-of-the-art methods on classification problems and the effect of dataset shift on accuracy and calibration is presented, finding that traditional post-hoc calibration does indeed fall short, as do several other previous methods.

Test-Time Training with Self-Supervision for Generalization under Distribution Shifts

This work turns a single unlabeled test sample into a self-supervised learning problem, on which the model parameters are updated before making a prediction, which leads to improvements on diverse image classification benchmarks aimed at evaluating robustness to distribution shifts.

Do CIFAR-10 Classifiers Generalize to CIFAR-10?

This work measures the accuracy of CIFAR-10 classifiers by creating a new test set of truly unseen images and finds a large drop in accuracy for a broad range of deep learning models.

Likelihood Ratios for Out-of-Distribution Detection

This work investigates deep generative model based approaches for OOD detection and observes that the likelihood score is heavily affected by population level background statistics, and proposes a likelihood ratio method forDeep generative models which effectively corrects for these confounding background statistics.

Computing the Testing Error Without a Testing Set

This work derives a set of persistent topology measures that identify when a DNN is learning to generalize to unseen samples, and provides extensive experimental validation on multiple networks and datasets to demonstrate the feasibility of the proposed approach.

Self-Supervised Learning Across Domains

This paper proposes a multi-task method, combining supervised and self-supervised knowledge, that provides competitive results with respect to more complex domain generalization and adaptation solutions, and proves its potential in the novel and challenging predictive and partial domain adaptation scenarios.