Nextflow enables reproducible computational workflows

  title={Nextflow enables reproducible computational workflows},
  author={Paolo Di Tommaso and Maria Chatzou and Evan W. Floden and Pablo Prieto Barja and Emilio Palumbo and C{\'e}dric Notredame},
  journal={Nature Biotechnology},
1. Weinstein, J.N. et al. Nat. Genet. 45, 1113–1120 (2013). 2. Zhang, J. et al. Database. database/bar026 (2011) 3. Siva, N. Lancet 385, 103–104 (2015). 4. McKenna, A. et al. Genome Res. 20, 1297–1303 (2010). 5. UNC Bioinformatics. TCGA mRNA-seq pipeline for UNC data. mRNAseq_TCGA/UNC_mRNAseq_summary.pdf (2013). 6. Albrecht, M., Michael, A., Patrick, D., Peter, B. & Douglas, T. in Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (SWEET ’12) 1. ACM (Association of Computing Machinery. http://dx.doi. org/10.1145/2443416.2443417 (2012). 7. Bernhardsson, E. & Frieder, E. Luigi. Github https:// (2016). 8. Goecks, J., Nekrutenko, A. & Taylor, J. Genome Biol. 11, R86 (2010). 9. UCSC. Xena (2016). comprehensive analyses. [] Key Result Further, it means that results can be reproduced using the original computation’s set of tools and parameters. If we had run the original TCGA best-practices RNA-seq pipeline with one sample per node, it would have cost ~$800,000. Through the use of efficient algorithms (STAR and Kallisto) and Toil, we were able to reduce the final cost to $26,071 (Supplementary Note 9). We have demonstrated the utility of Toil by creating one of the single largest, consistently analyzed, public human RNA…
Scalable Workflows and Reproducible Data Analysis for Genomics.
This chapter shows how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere and how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow language (GWL), Snakemake, and Nextflow.
GenPipes: an open-source framework for distributed and scalable genomic analyses
GenPipes is a flexible Python-based framework that facilitates the development and deployment of multi-step workflows optimized for High Performance Computing clusters and the cloud, and offers genomic researchers a simple method to analyze different types of data.
Transcriptome annotation in the cloud: complexity, best practices, and cost.
It is demonstrated that public cloud providers are a practical alternative for the execution of advanced computational biology experiments at low cost and the choice of cloud platform is not dependent on the workflow but, rather, on the specific details and requirements of the cloud provider.
CGAT-core: a python framework for building scalable, reproducible computational biology workflows
This work has developed CGAT-core, a python package for the rapid construction of complex computational workflows, which seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging.
Scalable Systems and Algorithms for Genomic Variant Analysis
This dissertation describes the ADAM system for processing large genomic datasets using distributed computing and implements an end-to-end variant calling pipeline using ADAM’s APIs, which provides state-of-the-art SNV calling accuracy, along with high (97%) INDEL calling accuracy.
ZARP: An automated workflow for processing of RNA-seq data
ZARP is a general purposeRNA-seq analysis workflow which builds on state-of-the-art software in the field to facilitate the analysis of RNA-seq data sets and is built using modern technologies with the ultimate goal to reduce the hands-on time for bioinformaticians and non-expert users.
Bioinformatics recipes: creating, executing and distributing reproducible data analysis workflows
A web application that allows researchers to publish and execute data analysis scripts that bioinformaticians are able to deploy data analysis workflows (recipes) that their collaborators can execute via point and click interfaces is developed.
BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments
This work presents BioWorkbench, a framework for managing and analyzing bioinformatics experiments, and shows that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time.
Bioinformatics Pipeline using JUDI: Just Do It
JUDI is developed on top of a Python based WMS, DoIt, for a systematic handling of pipeline parameter settings based on the principles of DBMS that simplifies plug-and-play scripting.
xGAP: a python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery
MOTIVATION Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next generation sequencing (NGS) data for


Companion: a web server for annotation and analysis of parasite genomes
The Companion web server is developed providing parasite genome annotation as a service using a reference-based approach and the use and performance is demonstrated by annotating two Leishmania and Plasmodium genomes as typical parasite cases and compared to manually annotated references. Interactive Viewing and Comparison of Large Phylogenetic Trees on the Web
TLDR is introduced, a web application to visualize and compare phylogenetic trees side-by-side and has distinctive features are: highlighting of similarities and differences between two trees, automatic identification of the best matching rooting and leaf order, scalability to large trees, high usability, multiplatform support via standard HTML5 implementation, and possibility to store and share visualizations.
Snakemake - a scalable bioinformatics workflow engine
SUMMARY Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute
ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data
The Environment for Tree Exploration v3 is presented, featuring numerous improvements in the underlying library of methods, and providing a novel set of standalone tools to perform common tasks in comparative genomics and phylogenetics.
Democratic databases: science on GitHub
GitHub, a hugely popular website for collaborative work on software code, is an increasingly popular site for researchers to share, maintain and update scientific data sets and code and is “the biggest revelation in my workflow”, says Daniel Falster, a postdoctoral researcher in ecology.
AlgoRun: a Docker-based packaging system for platform-agnostic implemented algorithms
AlgoRun, a dedicated packaging system for implemented algorithms, using Docker technology, addresses the growing need in bioinformatics for easy-to-use software implementations of algorithms that are usable across platforms.
The impact of Docker containers on the performance of genomic pipelines
It is concluded that Docker containers have only a minor impact on the performance of common genomic pipelines, which is negligible when the executed jobs are long in terms of computational time.
Tools and techniques for computational reproducibility
No single strategy is sufficient for every scenario; thus it is often useful to combine approaches, and seven such strategies are described.
Near-optimal probabilistic RNA-seq quantification
Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases, which removes a major computational bottleneck in RNA-seq analysis.
Differential analysis of gene regulation at transcript resolution with RNA-seq
Cuffdiff 2, an algorithm that estimates expression at transcript-level resolution and controls for variability evident across replicate libraries, robustly identifies differentially expressed transcripts and genes and reveals differential splicing and promoter-preference changes.