Corpus ID: 11552348

Parallelizing bioinformatics applications with MapReduce

@inproceedings{Gaggero2008ParallelizingBA,
  title={Parallelizing bioinformatics applications with MapReduce},
  author={Massimo Gaggero and Simone Leo and Simone Manca and Federico Andrea Santoni and Omar Schiaratura and Gianluigi Zanetti},
  year={2008}
}
Current bioinformatics applications require both management of huge amounts of data and heavy computation: fulfilling these requirements calls for simple ways to implement parallel computing. MapReduce is a general-purpose parallelization technology that appears to be particularly well adapted to this task. Here we report on its application, using its open source implementation Hadoop, to two relevant algorithms: BLAST and GSEA. The first is characterized by streaming computation on large data… Expand

Figures and Tables from this paper

Survey of MapReduce frame operation in bioinformatics
TLDR
This article presents MapReduce frame-based applications that can be employed in the next-generation sequencing and other biological domains and discusses the challenges faced by this field as well as the future works on parallel computing in bioinformatics. Expand
Consensus Sigma-70 Promoter Prediction Using Hadoop
TLDR
This work examines the application of Hadoop to patterns of this nature, using as its focus a well established workflow for identifying promoters - binding sites for regulatory proteins - across multiple gene regions and organisms, coupled with the unifying step of assembling these results into a consensus sequence. Expand
Kafka interfaces for composable streaming genomics pipelines
TLDR
This work decomposes the first steps of the genomic processing in two distinct and specialized modules (preprocessing and alignment) and loosely compose them via communication through Kafka streams, in order to allow for easy composability and integration in the already existing Hadoop-based pipelines. Expand
MapReducing a genomic sequencing workflow
TLDR
This work presents a MapReduce workflow that harnesses Hadoop to post-process the data produced by deep sequencing machines, and shows that it provides a scalable solution with a significantly improved throughput over its predecessor. Expand
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends
TLDR
The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinicalbig data analytics tools. Expand
Hadoop Applications in Bioinformatics
TLDR
Hadoop-based applications employed in bioinformatics, covering next-generation sequencing and other biological domains are presented, and obstacles and future works about Hadoop in bio informatics are discussed. Expand
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
TLDR
Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing, due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadooper/HBase. Expand
The Role of Distributed Computing in Big Data Science: Case Studies in Forensics and Bioinformatics - Abstract in English
The era of Big Data is leading the generation of large amounts of data, which require storage and analysis capabilities that can be only addressed by distributed computing systems. To facilitateExpand
Hadoop Mapreduce Based Distributed Phylogenetic Analysis
Phylogenetic analysis is most important in scientific research of evolution of life, it is a measure of footprints between organisms and analysis requires multiple sequence alignment as input. EvenExpand
Large Scale, Complex Processing of Health Data with MapReduce
TLDR
The article describes a solution to process large volumes of unstructured health social media data in a scalable fashion using the MapReduce framework and achieves significant improvement in processing performance by dividing the processing across a cluster of processors. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 25 REFERENCES
Squid – a simple bioinformatics grid
TLDR
Results show that a Squid application, working with N nodes and proper network resources, can process BLAST queries almost N times faster than if working with only one computer. Expand
GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework: Research Articles
TLDR
Results presented here show that for large problem sizes, a distributed, Grid-enabled version of a bioinformatics application, BLAST, using Globus as the Grid middleware can help in significantly reducing execution times. Expand
ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis
TLDR
ScalaBLAST is developed, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. Expand
ABCGrid: Application for Bioinformatics Computing Grid
TLDR
A mechanism to install and update all applications and databases in worker nodes automatically to reduce the workload of manual maintenance is implemented and a backup task method and self-adaptive job dispatch approach are used to improve performance. Expand
The design, implementation, and evaluation of mpiBLAST
TLDR
This work presents the software architecture of mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speed-up by segmenting a BLAST database and then having each node in a computational cluster search a unique portion of the database. Expand
Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST)
TLDR
This paper describes a software application, termed Windows .NET Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), which enhances the BLAST utility by improving usability, fault recovery, and scalability in a Windows desktop environment. Expand
PLINK: a tool set for whole-genome association and population-based linkage analyses.
TLDR
This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies. Expand
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
TLDR
It is demonstrated how the GSEA method yields insights into several cancer-related data sets, including leukemia and lung cancer, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer. Expand
Soap-HT-BLAST: high throughput BLAST based on Web services
SUMMARY A high throughput Basic Local Alignment Search Tool (BLAST) system based on Web services is implemented. It provides an alternative BLAST service and allows users to perform multiple BLASTExpand
Basic local alignment search tool.
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP)Expand
...
1
2
3
...