Retiring Adult: New Datasets for Fair Machine Learning
@article{Ding2021RetiringAN, title={Retiring Adult: New Datasets for Fair Machine Learning}, author={Frances Ding and Moritz Hardt and John Miller and Ludwig Schmidt}, journal={ArXiv}, year={2021}, volume={abs/2108.04884} }
Although the fairness community has recognized the importance of data, researchers in the area primarily rely on UCI Adult when it comes to tabular data. Derived from a 1994 US Census survey, this dataset has appeared in hundreds of research papers where it served as the basis for the development and comparison of many algorithmic fairness interventions. We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit…
Figures and Tables from this paper
100 Citations
Tackling Documentation Debt: A Survey on Algorithmic Fairness Datasets
- Computer ScienceEAAMO
- 2022
This work surveys over two hundred datasets employed in algorithmic fairness research, producing standardized and searchable documentation for each of them, and summarizes the merits and limitations of Adult, COMPAS, and German Credit, calling into question their suitability as general-purpose fairness benchmarks.
Algorithmic fairness datasets: the story so far
- Computer ScienceData Mining and Knowledge Discovery
- 2022
This work surveys over two hundred datasets employed in algorithmic fairness research, and produces standardized and searchable documentation for each of them, rigorously identifying the three most popular fairness datasets, namely Adult, COMPAS, and German Credit, for which this unifying documentation effort supports multiple contributions.
Data-Centric Factors in Algorithmic Fairness
- Computer ScienceAIES
- 2022
A new dataset on recidivism in 1.5 million criminal cases from courts in the U.S. state of Wisconsin, 2000-2018 is introduced and it is found that factors often do influence fairness metrics holding the classifier specification constant, without having a corresponding effect on accuracy metrics.
Towards Intersectionality in Machine Learning: Including More Identities, Handling Underrepresentation, and Performing Evaluation
- Computer ScienceFAccT
- 2022
This work grapple with questions that arise along three stages of the machine learning pipeline when incorporating intersectionality as multiple demographic attributes: which demographic attributes to include as dataset labels, how to handle the progressively smaller size of subgroups during model training, and how to move beyond existing evaluation metrics when benchmarking model fairness for more subgroups.
A survey on datasets for fairness‐aware machine learning
- Computer ScienceWIREs Data Mining and Knowledge Discovery
- 2022
This paper overviews real‐world datasets used for fairness‐aware ML by identifying relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network.
FLEA: Provably Fair Multisource Learning from Unreliable Training Data
- Computer ScienceArXiv
- 2021
FLEA is introduced, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training and is proved formally that –given enough data– FLEA protects the learner against unreliable data.
Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation
- Computer ScienceArXiv
- 2022
AF is presented, the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets, generated by applying state-of-the-art tabular data generation techniques on an anonymized, real-world bank account opening fraud detection dataset.
FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data
- Computer Science
- 2021
FLEA is not a replacement of prior fairness-aware learning methods but rather an augmentation that makes any of them robust against unreliable training data, and it is proved formally that –given enough data– FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half.
Achieving Downstream Fairness with Geometric Repair
- Computer ScienceArXiv
- 2022
It is argued that fairer classification outcomes can be produced through the development of setting-speci fic interventions, and it is shown that attaining distributional parity minimizes rate disparities across all thresholds in the up/downstream setting.
Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation
- Computer ScienceArXiv
- 2022
This work suggests that tree-based ensemble models make anective baseline for tabular data, and are a sensible default when subgroup robustness is desired, even when compared to robustness- and fairness-enhancing methods.
References
SHOWING 1-10 OF 65 REFERENCES
AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias
- Computer ScienceIBM J. Res. Dev.
- 2019
A new open-source Python toolkit for algorithmic fairness, AI Fairness 360 (AIF360), released under an Apache v2.0 license, to help facilitate the transition of fairness research algorithms for use in an industrial setting and to provide a common framework for fairness researchers to share and evaluate algorithms.
Data and its (dis)contents: A survey of dataset development and use in machine learning research
- Computer SciencePatterns
- 2021
It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
It is analyzed how current research and publication practices in algorithmic fairness can be ill-suited for meaningful engagement with fairness in CJ applications and can exacerbate previously delineated issues with data quality, real-world relevance, and inadvertent normative implications.
Learning Fair Representations
- Computer ScienceICML
- 2013
We propose a learning algorithm for fair classification that achieves both group fairness (the proportion of members in a protected group receiving positive classification is identical to the…
Unbiased look at dataset bias
- Computer ScienceCVPR 2011
- 2011
A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented.
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
- Computer ScienceNIPS
- 2016
This work empirically demonstrates that its algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks.
Lessons from archives: strategies for collecting sociocultural data in machine learning
- Computer ScienceFAT*
- 2020
It is argued that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures for sociocultural data collection.
How Copyright Law Can Fix Artificial Intelligence's Implicit Bias Problem
- Law
- 2017
As the use of artificial intelligence (AI) continues to spread, we have seen an increase in examples of AI systems reflecting or exacerbating societal bias, from racist facial recognition to sexist…
Equality of Opportunity in Supervised Learning
- Computer ScienceNIPS
- 2016
This work proposes a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features and shows how to optimally adjust any learned predictor so as to remove discrimination according to this definition.
Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy
- Computer ScienceFAT*
- 2020
This paper examines ImageNet, a large-scale ontology of images that has spurred the development of many modern computer vision methods, and considers three key factors within the person subtree of ImageNet that may lead to problematic behavior in downstream computer vision technology.