Leveraging Machine Learning to Detect Data Curation Activities

  title={Leveraging Machine Learning to Detect Data Curation Activities},
  author={Sara Lafia and Andrea K. Thomer and David Bleckley and Dharma Akmon and Libby Hemphill},
  journal={2021 IEEE 17th International Conference on eScience (eScience)},
This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR, a large social sciences data archive. The systems we studied track curation work and coordinate team decision-making at ICPSR. Archive staff use these systems to organize, prioritize, and document curation work done on datasets, making them promising resources for studying curation work and its impact on data reuse, especially in combination with data usage analytics. A key challenge… Expand
1 Citations

Figures and Tables from this paper

An Insider’s Take on Data Curation: Context, Quality, and Efficiency
This commentary describes how context, quality, and efficiency guide data curation at the University of Michigan's Inter-university Consortium for Political and Social Research (ICPSR). These threeExpand


The data archive as factory: Alienation and resistance of data processors
This article approaches data processing by combining scholarship on invisible labor in knowledge infrastructures with a Marxian framework and proposes a four-step framework to better value the social contribution of data workers beyond the archive. Expand
Data Practices and Curation Vocabulary (DPCVocab): An empirically derived framework of scientific data practices and curatorial processes
The present article covers the DPCVocab development process and examines applications for mapping relationships across the 3 categories, identifying factors for projecting curation costs and important differences in curation requirements across disciplines. Expand
brat: a Web-based Tool for NLP-Assisted Text Annotation
The brat rapid annotation tool (BRAT) is introduced, an intuitive web-based tool for text annotation supported by Natural Language Processing (NLP) technology and an evaluation of annotation assisted by semantic class disambiguation on a multicategory entity mention annotation task, showing a 15% decrease in total annotation time. Expand
Library cultures of data curation: Adventures in astronomy
This study presents a study of two university libraries who partnered with the Sloan Digital Sky Survey (SDSS) collaboration to curate a significant astronomy data set, and offers lessons in understanding how libraries choose curation paths and how these choices influence possibilities for data reuse. Expand
Two Computational Models for Analyzing Political Attention in Social Media
Two computational models that automatically distinguish topics in politicians' social media content are described, one supervised classifier and one unsupervised topic model, which are effective, inexpensive computational tools for political communication and social media research. Expand
A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences
Representatives from a number of environmental and Earth science repositories evaluate approaches for assessing the costs and benefits of publishing scientific data in their repositories, identifying various metrics that repositories typically use to report on the impact and value of their data products and services, plus additional metrics that would be useful but are not typically measured. Expand
Digital data archives as knowledge infrastructures: Mediating data sharing and reuse
DANS, the Data Archiving and Networked Services institute of The Netherlands, which manages 50+ years of data from the social sciences, humanities, and other domains, is studied, revealing that a few large contributors provide a steady flow of content, but most are academic researchers who submit data sets infrequently and often restrict access to their files. Expand
Proper Attribution for Curation and Maintenance of Research Collections: Metadata Recommendations of the RDA/TDWG Working Group
Recommendations for the representation of attribution metadata include the use of PROV entities and properties to link people, the curatorial actions they perform, and the digital or physical objects they are curating. Expand
Virtuous and vicious circles in the data life-cycle
Data Cleaners for Pristine Datasets: Visibility and Invisibility of Data Processors in Social Science
  • J. Plantin
  • Computer Science
  • Science, Technology, & Human Values
  • 2018
The work of processors who curate and “clean” the data sets that researchers submit to data archives for archiving and further dissemination is investigated, showing that the organization of data processing directly stems from the conception that the archive promotes of a valid data set. Expand