Model Cards for Model Reporting

@article{Mitchell2019ModelCF,
  title={Model Cards for Model Reporting},
  author={Margaret Mitchell and Simone Wu and Andrew Zaldivar and Parker Barnes and Lucy Vasserman and Ben Hutchinson and Elena Spitzer and Inioluwa Deborah Raji and Timnit Gebru},
  journal={Proceedings of the Conference on Fairness, Accountability, and Transparency},
  year={2019}
}
Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such… 

Figures from this paper

Identifying Systemic Bias in the Acquisition of Machine Learning Decision Aids for Law Enforcement Applications
TLDR
Current and planned software implementations of artificial intelligence and machine learning algorithms used by law enforcement agencies and other agencies to aid in decisionmaking should be examined for potential bias.
Problems in the deployment of machine-learned models in health care
CMAJ | SEPTEMBER 7, 2021 | VOLUME 193 | ISSUE 35 E1391 I n a companion article, Verma and colleagues discuss how machine-learned solutions can be developed and implemented to support medical
Presenting machine learning model information to clinical end users with model facts labels
TLDR
A systematic effort to ensure that front-line clinicians actually know how, when, how not, and when not to incorporate model output into clinical decisions is presented.
Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database
TLDR
Group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed, and a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation.
LuPe: A System for Personalized and Transparent Data-driven Decisions
TLDR
This work demonstrates LuPe, a system that allows to optimize the choice of the applied model for subgroups of the population or individuals, thereby personalizing the model choice to best fit users' profiles, which improves fairness.
EXPLAINABLE MACHINE LEARNING PRACTICES: OPENING AN ADDITIONAL BLACK-BOX FOR RELIABLE MEDICAL AI
TLDR
It is claimed that in order to regulate AI tools and evaluate their reliability, agencies need an explanation of how ML tools have been built, which requires documenting and justifying the technical choices that practitioners have made in designing such tools.
Evaluation Gaps in Machine Learning Practice
TLDR
The evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations are examined, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.
A comparison of approaches to improve worst-case predictive model performance over patient subpopulations
TLDR
A large-scale empirical study of DRO and several variations of standard learning procedures is conducted to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data.
Explainable machine learning practices: opening another black box for reliable medical AI
TLDR
It is claimed that to regulate AI tools and evaluate their reliability, agencies need an explanation of how ML tools have been built, which requires documenting and justifying the technical choices that practitioners have made in designing such tools.
EFAR-MMLA: An Evaluation Framework to Assess and Report Generalizability of Machine Learning Models in MMLA
TLDR
This paper proposes an evaluation framework to assess and report the generalizability of ML models in MMLA (EFAR-MMLA), and presents a case study with two datasets, each with audio and log data collected from a classroom during a collaborative learning session, to illustrate the usefulness of this framework.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards
TLDR
The Dataset Nutrition Label is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset "ingredients" before AI model development.
Measuring and Mitigating Unintended Bias in Text Classification
TLDR
A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias.
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
TLDR
Concept Activation Vectors (CAVs) are introduced, which provide an interpretation of a neural net's internal state in terms of human-friendly concepts, and may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.
Did the Model Understand the Question?
TLDR
Analysis of state-of-the-art deep learning models for question answering on images, tables, and passages of text finds that these deep networks often ignore important question terms, and demonstrates that attributions can augment standard measures of accuracy and empower investigation of model performance.
Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement
TLDR
The nature of the prediction in diagnosis is estimating the probability that a specific outcome or disease is present (or absent) within an individual, at this point in timethat is, the moment of prediction (T= 0), and prognostic prediction involves a longitudinal relationship.
Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets
TLDR
This paper presents a strategy that involves reducing the sizes of the majority data and generating synthetic samples for the minority data and indicates that the classification performance of the proposed approach is better than that of above-mentioned methods.
Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science
TLDR
It is argued that data statements will help alleviate issues related to exclusion and bias in language technology; lead to better precision in claims about how NLP research can generalize and thus better engineering results; protect companies from public embarrassment; and ultimately lead to language technology that meets its users in their own preferred linguistic style and furthermore does not misrepresent them to others.
Face Recognition Performance: Role of Demographic Information
TLDR
It is shown that an alternative to dynamic face matcher selection is to train face recognition algorithms on datasets that are evenly distributed across demographics, as this approach offers consistently high accuracy across all cohorts.
Equality of Opportunity in Supervised Learning
TLDR
This work proposes a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features and shows how to optimally adjust any learned predictor so as to remove discrimination according to this definition.
Increasing Trust in AI Services through Supplier's Declarations of Conformity
TLDR
This paper envisiones an SDoC for AI services to contain purpose, performance, safety, security, and provenance information to be completed and voluntarily released by AI service providers for examination by consumers.
...
1
2
3
4
...