Identifying the Context Shift between Test Benchmarks and Production Data

  title={Identifying the Context Shift between Test Benchmarks and Production Data},
  author={Matt Groh},
  • Matt Groh
  • Published 3 July 2022
  • Computer Science
  • ArXiv
Across a wide variety of domains, there exists a performance gap between machine learning models’ accuracy on dataset benchmarks and real-world production data. Despite the careful de-sign of static dataset benchmarks to represent the real-world, models often err when the data is out-of-distribution relative to the data the models have been trained on. We can directly measure and adjust for some aspects of distribution shift, but we cannot address sample selection bias, adversarial… 

Towards Transparency in Dermatology Image Datasets with Skin Tone Annotations by Experts, Crowds, and an Algorithm

It is demonstrated that algorithms based on ITA-FST are not reliable for annotating large-scale image datasets, but human-centered, crowd-based protocols can reliably add skin type transparency to dermatology datasets.



BREEDS: Benchmarks for Subpopulation Shift

We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training.

WILDS: A Benchmark of in-the-Wild Distribution Shifts

WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.

Dataset Shift in Machine Learning

This volume offers an overview of current efforts to deal with dataset and covariate shift, and places dataset shift in relationship to transfer learning, transduction, local learning, active learning, and semi-supervised learning.

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

To understand ML practitioners’ data documentation perceptions, needs, challenges, and desiderata, seven design requirements for future data documentation frameworks such as more actionable guidance on how the characteristics of datasets might result in harms and how these harms might be mitigated are derived.

AI and the Everything in the Whole Wide World Benchmark

Why these benchmarks consistently fall short of capturing meaningful abstractions of the declared motivations, present distorted data lenses of a specific worldview to be optimized for, and disguise key limitations in order to misrepresent the nature of actual “state of the art” (SOTA) performance of AI systems are discussed.

Dynabench: Rethinking Benchmarking in NLP

It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.

From ImageNet to Image Classification: Contextualizing Progress on Benchmarks

This work uses human studies to investigate the consequences of employing a noisy data collection pipeline and study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset---including the introduction of biases that state-of-the-art models exploit.

The Clinician and Dataset Shift in Artificial Intelligence.

This letter outlines how to identify, and potentially mitigate, common sources of “dataset shift” in machine-learning systems.

Unbiased look at dataset bias

A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented.