Improving data workflow systems with cloud services and use of open data for bioinformatics research

@article{Karim2018ImprovingDW,
  title={Improving data workflow systems with cloud services and use of open data for bioinformatics research},
  author={Md. Rezaul Karim and Audrey M. Michel and Achille Zappa and P. Baranov and Ratnesh Sahay and Dietrich Rebholz-Schuhmann},
  journal={Briefings in Bioinformatics},
  year={2018},
  volume={19},
  pages={1035 - 1050}
}
&NA; Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large‐scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud… 

Figures and Tables from this paper

Towards Reproducible Bioinformatics: The OpenBio-C Scientific Workflow Environment

TLDR
The insufficiencies of current workflow editing and execution environments are explored, and the key-directions to advance over an environment that supports reproducibility are stated.

Laniakea: an open solution to provide Galaxy “on-demand” instances over heterogeneous cloud infrastructures

TLDR
Laniakea is presented, a robust and feature-rich software suite which can be deployed on any scientific or commercial Cloud infrastructure in order to provide a “Galaxy on demand” Platform as a Service (PaaS).

On Distributed Collaboration for Biomedical Analyses

TLDR
This article motivates the need for real distributed biomedical analyses in the context of several ongoing projects, including the ICAN project that involves 34 French hospitals and affiliated research groups and presents a set of distributed architectures for such analyses that allow for scalability, security/privacy and reproducibility issues to be taken into account.

Experimenting with reproducibility in bioinformatics

TLDR
A case study of an attempt to reproduce a promising bioinformatics method and the challenges to use a published method for which code and data were available and proposed solutions to improve reproducibility and research efficiency at the individual and collective level.

doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

TLDR
Doepipeline is presented, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs, which provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking.

doepipeline: a systematic approach for optimizing multi-level and multi-step data processing workflows

TLDR
Doepipeline is presented, a novel approach for optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs, which provides a systematic and robust framework to optimize software parameter settings.

A Modeling Approach for Bioinformatics Workflows

TLDR
The Unified Modeling Language (UML) Activity Diagram is extended to the bioinformatics domain by including domain-specific concepts and notations, and a template was created to document the same concepts in a text format.

A UML Activity Diagram Extension and Template for Bioinformatics Workflows: A Design Science Study

TLDR
This thesis aims to extend the Unified Modelling Language (UML) activity diagram (AD) to the bioinformatics domain by including domain-specific and understandable concepts and notations, and creates a template to document the same concepts in a written format.

Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives

TLDR
A systematic literature review of potentially relevant initiatives in precision oncology data sharing and a centralized solution to participate in data sharing through federated solutions such as the Beacon Networks are built.

Constructing a Quantitative Fusion Layer over the Semantic Level for Scalable Inference

TLDR
A methodology and a corresponding system to bridge the gap between prioritization tools with fixed target and unrestricted semantic queries and a step-by-step guide for the methodology using a macular degeneration model, including drug, target and disease domains are presented.

References

SHOWING 1-10 OF 114 REFERENCES

Experiences with workflows for automating data-intensive bioinformatics

TLDR
This paper presents the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead, and defines a set of recommendations for future systems to enable efficient yet simple bioInformatics workflow construction and execution.

Rethinking data management for big data scientific workflows

TLDR
This paper presents two general approaches, one that exclusively uses object stores to store all the files accessed and generated by a workflow, while the other relies on the shared filesystem for caching intermediate data sets.

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision

TLDR
Tests of SparkSeq prove its scalability and overall fast performance by running the analyses of sequencing datasets, and prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes.

Open source workflow systems in life sciences informatics

TLDR
Although SWMS, including open source ones, have several open issues, their unique features and strong momentum clearly suggest that it is only a matter of time before they are adopted in even more scientific fields.

Parallelization in Scientific Workflow Management Systems

TLDR
The survey gives an overview of parallelization techniques for SWfMS, both in theory and in their realization in concrete systems, and finds that current systems leave considerable room for improvement and proposes key advancements to the landscape ofSWfMS.

Scientific workflow systems - can one size fit all?

  • V. CurcinM. Ghanem
  • Computer Science
    2008 Cairo International Biomedical Engineering Conference
  • 2008
TLDR
This paper provides a high-level framework for comparing the systems based on their control flow and data flow properties with a view of both informing future research in the area by academic researchers and facilitating the selection of the most appropriate system for a specific application task by practitioners.

Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support

TLDR
Tavaxy reduces the workflow development cycle by introducing the use of workflow patterns to simplify workflow creation and enables the re-use and integration of existing (sub-) workflows from Taverna and Galaxy, and allows the creation of hybrid workflows.

Distributed workflow-driven analysis of large-scale biological data using biokepler

TLDR
The challenges related to next-generation sequencing data are discussed and the approaches taken in bioKepler to help with analysis of such data are explained.

Kepler: an extensible system for design and execution of scientific workflows

TLDR
The Kepler scientific workflow system provides domain scientists with an easy-to-use yet powerful system for capturing scientific workflows (SWFs), a formalization of the ad-hoc process that a scientist may go through to get from raw data to publishable results.

The RDF Pipeline Framework: Automating Distributed, Dependency-Driven Data Pipelines

TLDR
This paper explains how distributed healthcare data production processes can be conveniently defined in RDF as executable dependency graphs, using the RDF Pipeline Framework, thus avoiding unnecessary data regeneration.
...