Principles for data analysis workflows

@article{Stoudt2021PrinciplesFD,
  title={Principles for data analysis workflows},
  author={Sara Stoudt and V{\'a}leri N V{\'a}squez and Ciera C. Martinez},
  journal={PLoS Computational Biology},
  year={2021},
  volume={17}
}
A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are… 

Figures from this paper

Orienting, Framing, Bridging, Magic, and Counseling: How Data Scientists Navigate the Outer Loop of Client Collaborations in Industry and Academia

This novel outer-loop workflow contributes to CSCW by expanding the notion of what collaboration means in data science beyond the widely-known inner-loop technical workflow stages of acquiring, cleaning, analyzing, modeling, and visualizing data.

A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object

A joint activity of X-omics and the Netherlands Twin Register demonstrating the FAIRification of a multi-omics data set and the development of a FAIR multi-omega data analysis workflow.

Sharing and Caring: Creating a Culture of Constructive Criticism in Computational Legal Studies

We introduce seven foundational principles for creating a culture of constructive criticism in computational legal studies. Beginning by challenging the current perception of papers as the primary

Ten simple rules on writing clean and reliable open-source scientific software

10 “rules” centered on 2 best practice components: clean code and testing are proposed, which can help to improve the correctness, quality, usability, and maintainability of open-source scientific software code.

Promoting Open Science Through Research Data Management

Describing data management as an integral part of a research process or workflow may help contextualize the importance of related resources, practices, and concepts for researchers who may be less familiar with them.

A Guide to Using GitHub for Developing and Versioning Data Standards and Reporting Formats

Data standardization combined with descriptive metadata facilitate data reuse, which is the ultimate goal of the Findable, Accessible, Interoperable, and Reusable (FAIR) principles. Community data or

Introducing Reproducibility to Citation Analysis: a Case Study in the Earth Sciences

This study replicated the prior citation study's conclusions, and also adapted the author’s methods to analyze the citation practices of Earth Scientists at four institutions, and found that 80% of the citations could be accounted for by only 7.88% of journals.

A Hydrologist’s Guide to Open Science

<p>To have lasting impact on the scientific community and broader society, hydrologic research must be open, accessible, reusable, and reproducible. With so many different perspectives on and

References

SHOWING 1-10 OF 94 REFERENCES

Software engineering for scientific big data analysis

A set of 10 guidelines to steer the creation of command-line computational tools that are usable, reliable, extensible, and in line with standards of modern coding practices are provided.

Streamlining data-intensive biology with workflow systems

The main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Good enough practices in scientific computing

A set of good computing practices that every researcher can adopt, regardless of their current level of computational skill are presented, which encompass data management, programming, collaborating with colleagues, organizing projects, tracking work, and writing manuscripts.

Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication

The applications reviewed support authors to publish reproducible research predominantly with literate programming, such as deployment options and features that support authors in creating and readers in studying executable papers.

Enhancing reproducibility for computational methods

A novel set of Reproducibility Enhancement Principles (REP) targeting disclosure challenges involving computation is presented, which build upon more general proposals from the Transparency and Openness Promotion guidelines and emerged from workshop discussions among funding agencies, publishers and journal editors, industry participants, and researchers representing a broad range of domains.

An empirical analysis of journal policy effectiveness for computational reproducibility

This work evaluates the effectiveness of journal policy that requires the data and code necessary for reproducibility be made available postpublication by the authors upon request and finds it to be an improvement over no policy, but currently insufficient for reproducecibility.

Opening the Publication Process with Executable Research Compendia

It is concluded that ERCs provide a novel potential to find, explore, reuse, and archive computer-based research.

Statistical Analyses and Reproducible Research

This article describes a software framework for both authoring and distributing integrated, dynamic documents that contain text, code, data, and any auxiliary content needed to recreate the computations in data analyses, methodological descriptions, simulations, and so on.

Next-generation sequencing data interpretation: enhancing reproducibility and accessibility

Currently pressing issues with analysis, interpretation, reproducibility and accessibility of next-generation sequencing data are discussed, and promising solutions are presented and potential future developments are explored.

Nextflow enables reproducible computational workflows

The utility of Toil is demonstrated by creating one of the single largest, consistently analyzed, public human RNA-seq expression repositories, which the community will find useful.
...