Datasheets for Datasets
@article{Gebru2018DatasheetsFD, title={Datasheets for Datasets}, author={Timnit Gebru and J. Morgenstern and Briana Vecchione and Jennifer Wortman Vaughan and H. Wallach and Hal Daum{\'e} and K. Crawford}, journal={ArXiv}, year={2018}, volume={abs/1803.09010} }
The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, we propose that every dataset be accompanied with a datasheet… CONTINUE READING
Supplemental Code
Github Repo
Via Papers with Code
IdenProf dataset is a collection of images of identifiable professionals. It is been collected to enable the development of AI systems that can serve by identifying people and the nature of their job by simply looking at an image, just like humans can do.
Paper Mentions
News Article
News Article
Blog Post
233 Citations
MT-Adapted Datasheets for Datasets: Template and Repository
- Computer Science
- ArXiv
- 2020
- 3
- Highly Influenced
- PDF
Data and its (dis)contents: A survey of dataset development and use in machine learning research
- Computer Science
- ArXiv
- 2020
- 6
- PDF
Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure
- Computer Science
- FAccT
- 2021
- 2
- PDF
A System Framework for Personalized and Transparent Data-Driven Decisions
- Computer Science
- CAiSE
- 2020
- 2
- PDF
The Best of Both Worlds: Challenges in Linking Provenance and Explainability in Distributed Machine Learning
- Computer Science
- 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)
- 2019
- 2
- PDF
Accountable Data Analytics Start with Accountable Data: The LiQuID Metadata Model
- Computer Science
- ER Forum/Posters/Demos
- 2020
- PDF
Dataset Reuse: Toward Translating Principles to Practice
- Computer Science, Medicine
- Patterns
- 2020
- Highly Influenced
Towards Standardization of Data Licenses: The Montreal Data License
- Computer Science, Mathematics
- ArXiv
- 2019
- 6
- PDF
Pitfalls in Machine Learning Research: Reexamining the Development Cycle
- Computer Science, Mathematics
- ArXiv
- 2020
- PDF
References
SHOWING 1-10 OF 59 REFERENCES
DataHub: Collaborative Data Science & Dataset Version Management at Scale
- Computer Science
- CIDR
- 2015
- 120
- PDF
Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?
- Computer Science
- CHI
- 2019
- 146
- PDF
Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use
- Computer Science
- Bull. IEEE Tech. Comm. Digit. Libr.
- 2016
- 32
- PDF
The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards
- Computer Science
- ArXiv
- 2018
- 87
- PDF
Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science
- Computer Science
- Transactions of the Association for Computational Linguistics
- 2018
- 123
- PDF
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
- Computer Science, Mathematics
- NIPS
- 2016
- 985
- PDF
Increasing Trust in AI Services through Supplier's Declarations of Conformity
- Biology, Computer Science
- IBM J. Res. Dev.
- 2019
- 92
- PDF