Advancing Computational Reproducibility in the Dataverse Data Repository Platform

  title={Advancing Computational Reproducibility in the Dataverse Data Repository Platform},
  author={Ana Trisovic and Philip Durbin and Tania Schlatter and Gustavo Durand and Sonia Barbosa and Danny Brooke and Merc{\`e} Crosas},
  journal={Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems},
  • A. Trisovic, P. Durbin, M. Crosas
  • Published 6 May 2020
  • Computer Science, Biology
  • Proceedings of the 3rd International Workshop on Practical Reproducible Evaluation of Computer Systems
Recent reproducibility case studies have raised concerns showing that much of the deposited research has not been reproducible. One of their conclusions was that the way data repositories store research data and code cannot fully facilitate reproducibility due to the absence of a runtime environment needed for the code execution. New specialized reproducibility tools provide cloud-based computational environments for code encapsulation, thus enabling research portability and reproducibility… 

Figures from this paper

A large-scale study on research code quality and execution
The quality and execution of research code from publicly-available replication datasets at the Harvard Dataverse repository finds that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices.
Designing a Service for Compliant Sharing of Sensitive Research Data
It is argued that a decentralized service that maintains metadata, a global view on all data usage, and active policy in combination with local monitoring and security enforcement can provide automated compliance checking and share sensitive data with a broader community rather than limiting access to only core project members.
Opportunities and Challenges in Democratizing Immunology Datasets
The opportunities and challenges in democratizing datasets, repositories, and community-wide knowledge sharing tools are reviewed and use cases for repurposing open-access immunology datasets with advanced machine learning applications are presented.
Repository Approaches to Improving the Quality of Shared Data and Code
A combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code are presented.
Reproducible Model Sharing for AI Practitioners
  • Amin Moradi, Alexandru Uta
  • Computer Science
    Proceedings of the Fifth Workshop on Distributed Infrastructures for Deep Learning (DIDL) 2021
  • 2021
The case for transparent and seamless model sharing is made to enable the ease of reviewing and reproducibility for ML practitioners and a platform to enable practitioners to deploy trained models and create easy-to-use inference environments is designed and implemented.
Toward Reusable Science with Readable Code and Reproducibility
This paper proposes an open-source platform named RE3 that helps improve the reproducibility and readability of research projects involving R code and incorporates assessing code readability with a machine learning model trained on a code readable survey and an automatic containerization service that executes code files and warns users of reproducible errors.


Computing Environments for Reproducibility: Capturing the "Whole Tale"
Reproducible Containers
DetTrace is used to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows, and it is shown that, while software in each of these domains is initially irreproducible, DetTrace brings reproducibles without requiring any hardware, OS or application changes.
A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks
To understand good and bad practices used in the development of real notebooks, 1.4 million notebooks from GitHub are studied and a detailed analysis of their characteristics that impact reproducibility is presented.
Computational Reproducibility via Containers in Psychology
Scientific progress relies on the replication and reuse of research. Recent studies suggest, however, that sharing code and data does not suffice for computational reproducibility —defined as
Implementing Computational Reproducibility in the Whole Tale Environment
The Tale emerges from the NSF funded Whole Tale project ( which is developing a computational environment designed to capture the entire computational pipeline associated with a scientific experiment and thereby enable computational reproducibility.
Implications for Nutrition Research from the National Academies' Report Reproducibility and Replicability in Science (P13-039-19).
Objectives To present the findings, conclusions, and recommendations of the National Academies report Reproducibility and Replicability in Science, as relevant to the nutrition research community,
Introducing eLife's first computationally reproducible article. eLife Labs
  • 2019
Introducing eLifeâĂŹs first computationally reproducible article
  • eLife Labs [Internet]
  • 2019
Introducing eLifeâĂŹs first computationally reproducible article. eLife Labs
  • 2019
Introducing eLife’s first computationally reproducible article
  • eLife Labs [Internet]
  • 2019