• Corpus ID: 235390804

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

  title={It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks},
  author={Michelle Bao and Angela Zhou and Samantha A. Zottola and Brian Brubach and Sarah L. Desmarais and Aaron Horowitz and Kristian Lum and Suresh Venkatasubramanian},
Risk assessment instrument (RAI) datasets, particularly ProPublica’s COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without ac-counting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in… 

Tables from this paper

Predictability and Surprise in Large Generative Models

This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

Cascaded Debiasing : Studying the Cumulative Effect of Multiple Fairness-Enhancing Interventions

The need for new fairness metrics that account for the impact on different population groups apart from just the disparity between groups is highlighted, and a list of combinations of interventions that perform best for different fairness and utility metrics are offered to aid the design of fair ML pipelines.

Fair Inference for Discrete Latent Variable Models

A fair stochastic variational inference technique for the discrete latent variables is developed, which is accomplished by including a fairness penalty on the variational distribution that aims to respect the principles of intersectionality, a critical lens on fairness from the legal, social science, and humanities literature, and optimizing the Variational parameters under this penalty.

Ex-Ante Assessment of Discrimination in Dataset

ForESEE, a FORES t of decision tr EE s algorithm, is proposed, which generates a score that captures how likely an individual’s response varies with sensitive attributes, and allows stakeholders to characterize risky samples that may contribute to discrimination, as well as, use the FORESEE to estimate the risk of upcoming samples.

A Novel Regularization Approach to Fair ML

A new approach to Explicitly Deweighted Features (EDF) reduces the impact of each feature among the proxies of sensitive variables, allowing a different amount of deweighting applied to each such feature.

GHC: U: Robustness of Fairness in Machine Learning

A framework to conduct experiments to test the robustness of a popular fairness metric is designed and finds that, when compared to more traditional performance metrics, it is more sensitive to fluctuations in the evaluation dataset in a variety of settings.

On the role of benchmarking data sets and simulations in method comparison studies

Differences and similarities between method comparison studies and simulation studies are investigated to discuss their advantages and disadvantages and to develop new approaches to the evaluation of methods picking the best of both worlds.

Data-Centric Factors in Algorithmic Fairness

A new dataset on recidivism in 1.5 million criminal cases from courts in the U.S. state of Wisconsin, 2000-2018 is introduced and it is found that factors often do influence fairness metrics holding the classifier specification constant, without having a corresponding effect on accuracy metrics.

More Data Can Lead Us Astray: Active Data Acquisition in the Presence of Label Bias

This work empirically shows that, when overlooking label bias, collecting more data can aggravate bias, and imposing fairness constraints that rely on the observed labels in the data collection process may not address the problem.

A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms

This work lays the foundation for co-designing a validity protocol, in collaboration with real-world stakeholders, to critically evaluate the justifiability of specific designs and uses of data-driven algorithmic systems.



Assessing Risk Assessment in Action

Recent years have seen a rush toward evidence-based tools in criminal justice. As part of this movement, many jurisdictions have adopted actuarial risk assessment to supplement or replace the ad-hoc

Reporting guidance for violence risk assessment predictive validity studies: the RAGEE Statement.

The present study aimed to develop the first set of reporting guidance for predictive validity studies of violence risk assessments: the RAGEE Statement, which has the potential to improve the quality of the risk assessment literature.

A Primer on Risk Assessment for Legal Decisionmakers

This primer is addressed to judges, parole board members, and other legal decisionmakers who use or are considering using the results of risk assessment instruments (RAIs) in making determinations

On the Validity of Arrest as a Proxy for Offense: Race and the Likelihood of Arrest for Violent Crimes

Bias in violent arrest data is investigated by analysing racial disparities in the likelihood of arrest for White and Black violent offenders from 16 US states as recorded in the National Incident Based Reporting System (NIBRS).

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

Surprisingly, it is found that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data.

Fairness Violations and Mitigation under Covariate Shift

An approach based on feature selection that exploits conditional independencies in the data to estimate accuracy and fairness metrics for the test set is specified and it is shown that for specific fairness definitions, the resulting model satisfies a form of worst-case optimality.

Closer than they appear: A Bayesian perspective on individual‐level heterogeneity in risk assessment

Risk assessment instruments are used across the criminal justice system to estimate the probability of some future event, such as failure to appear for a court appointment or re‐arrest. The estimated

The effect of differential victim crime reporting on predictive policing systems

It is demonstrated how differential victim crime reporting rates across geographical areas can lead to outcome disparities in common crime hot spot prediction models, which may lead to misallocations both in the form of over-policing and under-Policing.

Algorithmic Fairness: Choices, Assumptions, and Definitions

It is shown how choices and assumptions made—often implicitly—to justify the use of prediction-based decision-making can raise fairness concerns and a notationally consistent catalog of fairness definitions from the literature is presented.