• Corpus ID: 218684960

Identifying Statistical Bias in Dataset Replication

  title={Identifying Statistical Bias in Dataset Replication},
  author={Logan Engstrom and Andrew Ilyas and Shibani Santurkar and Dimitris Tsipras and Jacob Steinhardt and Aleksander Madry},
Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy, even after… 


  • Computer Science
  • 2020
An end-to-end debugging framework called Defuse is proposed to use regions on the data manifold that are incorrectly classified by a model for fixing faulty classifier predictions, and finds that it identifies and resolves concerning predictions while maintaining model generalization.

Improving robustness against common corruptions by covariate shift adaptation

It is argued that results with adapted statistics should be included whenever reporting scores in corruption benchmarks and other out-of-distribution generalization settings, and 32 samples are sufficient to improve the current state of the art for a ResNet-50 architecture.

A Principled Evaluation Protocol for Comparative Investigation of the Effectiveness of DNN Classification Models on Similar-but-non-identical Datasets

The experimental results indicate that the observed accuracy degradation between established benchmark datasets and their replications is consistently lower than the accuracy degradation reported in published works, with these published works relying on conventional evaluation approaches that do not utilize uncertainty-related information.


It is argued that venues must take more action to advance reproducible machine learning research today and there is a lack of evidence for effective actions taken by conferences to encourage and reward reproducibility.

A Siren Song of Open Source Reproducibility

It is argued that venues must take more action to advance reproducible machine learning research today and there is a lack of evidence for effective actions taken by conferences to encourage and reward reproducibility.

On Modality Bias Recognition and Reduction

A plug-and-play loss function method, whereby the feature space for each label is adaptively learned according to the training set statistics, which yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.

Generative multitask learning mitigates target-causing confounding

This work proposes a simple and scalable approach to causal representation learning for multitask learning that takes into account the dependencies between the targets in order to alleviate target-causing confounding, and improves robustness to target shift.

FOCUS: Familiar Objects in Common and Uncommon Settings

FOCUS is introduced, a dataset for stress-testing the generalization power of deep image classifiers and its dataset will aid re-searchers in understanding the inability of deep models to generalize well to uncommon settings and drive future work on improving their distribu-tional robustness.

Defuse: Training More Robust Models through Creation and Correction of Novel Model Errors

This work proposes Defuse: a technique that trains a generative model on a classifier’s training dataset and then uses the latent space to generate new samples which are no longer correctly predicted by the classi-er, and reveals novel sources of model errors.

Unsolved Problems in ML Safety

This work provides a new roadmap for ML Safety and presents four problems ready for research, namely withstanding hazards, identifying hazards, steering ML systems, and reducing deployment hazards.



Do CIFAR-10 Classifiers Generalize to CIFAR-10?

This work measures the accuracy of CIFAR-10 classifiers by creating a new test set of truly unseen images and finds a large drop in accuracy for a broad range of deep learning models.

Do ImageNet Classifiers Generalize to ImageNet?

The results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.

On the Dangers of Cross-Validation. An Experimental Evaluation

It is empirically show how under such large number of models the risk for overfitting increases and the performance estimated by cross validation is no longer an effective estimate of generalization; hence, this paper provides an empirical reminder of the dangers of cross validation.

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

ImageNet Large Scale Visual Recognition Challenge

The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.

Optimizing JPEG Quantization for Classification Networks

This work asks whether JPEG Q-tables exist that are "better" for specific vision networks and can offer better quality--size trade-offs than ones designed for human perception or minimal distortion.

A Meta-Analysis of Overfitting in Machine Learning

This study conducts the first large meta-analysis of overfitting due to test set reuse in the machine learning community based on over one hundred machine learning competitions hosted on the Kaggle platform over the course of several years and shows little evidence of substantial overfitting.

An empirical study of pretrained representations for few-shot classification

This paper systematically investigates which models provide the best representations for a few-shot image classification task when pretrained on the Imagenet dataset and test their representations when used as the starting point for different few- shot classification algorithms.

Testing Robustness Against Unforeseen Adversaries

This work introduces a total of four novel adversarial attacks to create ImageNet-UA's diverse attack suite, and demonstrates that, in comparison to Image net-UA, prevailing L_inf robustness assessments give a narrow account of model robustness.