Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

@article{Scheuerman2021DoDH,
  title={Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development},
  author={Morgan Klaus Scheuerman and Emily L. Denton and A. Hanna},
  journal={Proceedings of the ACM on Human-Computer Interaction},
  year={2021},
  volume={5},
  pages={1 - 37}
}
Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision's… 

Figures and Tables from this paper

Robots Enact Malignant Stereotypes

This paper finds that robots powered by large datasets and Dissolution Models that contain humans risk physically amplifying malignant stereotypes in general; and recommends that robot learning methods that physically manifest stereotypes or other harmful outcomes be paused, reworked, or even wound down when appropriate, until outcomes can be proven safe, effective, and just.

Documenting Data Production Processes

It is argued that a view of documentation as a boundary object, i.e., an object that can be used differently across organizations and teams but holds enough immutable content to maintain integrity, can be useful when designing documentation to retrieve heterogeneous, often distributed, contexts of data production.

Documenting Data Production Processes: A Participatory Approach for Data Work

It is argued that a view of documentation as a boundary object, i.e., an object that can be used differently across organizations and teams but holds enough immutable content to maintain integrity, can be useful when designing documentation to retrieve heterogeneous, often distributed, contexts of data production.

Towards Transparency in Dermatology Image Datasets with Skin Tone Annotations by Experts, Crowds, and an Algorithm

It is demonstrated that algorithms based on ITA-FST are not reliable for annotating large-scale image datasets, but human-centered, crowd-based protocols can reliably add skin type transparency to dermatology datasets.

What People Think AI Should Infer From Faces

It is argued that participatory approaches contribute valuable insights for the development of ethical AI in an increasingly visual data culture and non-experts’ justifications underscore the normative complexity behind facial AI inference-making.

Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research

How dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020 is studied to find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions.

Seeing like a driver: How workers repair, resist, and reinforce the platform's algorithmic visions

This article theorizes the relationship between two ways of “seeing” and organizing urban mobility markets: the abstract, algorithmic vision of the mobility platform and the experiential, relational

Understanding Emerging Obfuscation Technologies in Visual Description Services for Blind and Low Vision People

The framework of interdependence is turned to to unpack and understand obfuscation in VDS, enabling us to complicate privacy concerns, uncover the labor of Blind and low vision people, and emphasize the importance of safeguards.

Disordering Datasets

This study investigates the data science practices and design narratives that underlie AI-mediated behavioral health through the situational analysis of three natural language processing (NLP) training datasets and articulates the sensitizing concept of ordering datasets that aims to productively trouble dominant logics of AI/ML applications in behavioral health.

Interrogating Data Work as a Community of Practice

Reporting on results of interviews with 19 civic workers who perform data work as their main task, an atypical relationship between subject-domain experts (such as the authors' interviewees) and full members of the data work community is identified.
...

References

SHOWING 1-10 OF 127 REFERENCES

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science

It is argued that data statements will help alleviate issues related to exclusion and bias in language technology, lead to better precision in claims about how natural language processing research can generalize and thus better engineering results, protect companies from public embarrassment, and ultimately lead to language technology that meets its users in their own preferred linguistic style.

Excavating AI: the politics of images in machine learning training sets

By looking at the politics of classification within machine learning systems, this article demonstrates why the automated interpretation of images is an inherently social and political project. We

Documenting Computer Vision Datasets: An Invitation to Reflexive Data Practices

This paper identifies four key issues that hinder the documentation of image datasets and the effective retrieval of production contexts and proposes reflexivity, understood as a collective consideration of social and intellectual factors that lead to praxis, as a necessary precondition for documentation.

You Can't Sit With Us: Exclusionary Pedagogy in AI Ethics Education

It is claimed that the current AI ethics education space relies on a form of "exclusionary pedagogy," where ethics is distilled for computational approaches, but there is no deeper epistemological engagement with other ways of knowing that would benefit ethical thinking or an acknowledgement of the limitations of uni-vocal computational thinking.

“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI

This paper defines, identifies, and presents empirical evidence on Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality.

For You, or For"You"?: Everyday LGBTQ+ Encounters with TikTok

Design Justice: Community-Led Practices to Build the World We Need

Design Justice: Community-Led Practices to Build the World We Need by Sasha Constanza-Chock questions how design can create social equity. As creator of the MIT Civic Media Collaborative Design Stu...

Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

A rigorous framework for dataset development transparency that supports decision-making and accountability is introduced, which uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle.

Good systems, bad data?: Interpretations of AI hype and failures

A more complex infrastructural view of the tools, data, and operation of AI systems as necessary to the production of social good is proposed and how representations of the successes and failures of these systems, even among experts, tend to valorize algorithmic analysis and locate fault at the quality of the data rather than the implementation of systems is explored.
...