ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

@article{Oehmen2006ScalaBLASTAS,
  title={ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis},
  author={Christopher S. Oehmen and Jarek Nieplocha},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  year={2006},
  volume={17},
  pages={740-749}
}
  • C. Oehmen, J. Nieplocha
  • Published 1 August 2006
  • Computer Science
  • IEEE Transactions on Parallel and Distributed Systems
Genes in an organism's DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein… Expand
High-throughput computation of pairwise sequence similarities for multiple genome comparisons using ScalaBLAST
TLDR
This work presents an example of how ScalaBLAST, a high-throughput sequence analysis program, harnesses increasingly critical high-performance computing to perform sequence analysis, enabling, for example, all vs. all BLAST runs across 2 million protein sequences within a day using thousands of processors as opposed to conventional comparison methods that would take years to complete. Expand
Parallel genomic sequence-search on a massively parallel system
TLDR
Modifications and extensions to a parallel and distributed-memory version of BLAST called mpiBLAST-PIO are described and how it maps to a massively parallel system, specifically IBM Blue Gene/L (BG/L). Expand
A work stealing based approach for enabling scalable optimal sequence homology detection
TLDR
The design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers is presented and a combination of techniques from asynchronous load balancing, work stealing, dynamic task counters, data replication, and exact-matching filters are used to achieve homology Detection at scale. Expand
Massively parallel genomic sequence search on the Blue Gene/P architecture
TLDR
This paper presents its first experiences in mapping and optimizing genomic sequence search onto the massively parallel IBM Blue Gene/P (BG/P) platform, and demonstrates that such scalability enables it to complete a large-scale bioinformatics problem in only a few hours on BG/P. Expand
A performance analysis of genome search by matching whole targeted reads on different environments
TLDR
This study reviews this approach of identifying genes and compares the performance of different system environments, and proposes an approach referred to as the genome search system, which reduces the use of hardware resources to process whole assembled reads. Expand
pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs
TLDR
The method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Expand
Towards scalable optimal sequence homology detection
TLDR
This paper presents a scalable framework to conduct large-scale optimal homology detection on massively parallel super-computing platforms and employs distributed memory work stealing to effectively parallelize optimal pairwise alignment computation tasks. Expand
Accelerating Protein Sequence Search in a Heterogeneous Computing System
TLDR
An implementation of the BLAST algorithm for searching protein sequences in a heterogeneous computing system that delivers a seven-fold speedup over the sequential BLASTP for the most computationally intensive phase on a NVIDIA Fermi C2050 GPU. Expand
Eliminating Irregularities of Protein Sequence Search on Multicore Architectures
TLDR
This paper designs and develops a database indexed BLAST with the identical sensitivity as query indexed NCBI-BLAST, and proposes muBLASTP, that uses multiple optimizations to improve data locality and parallel efficiency for multicore architectures and multi-node systems. Expand
An adaptive multi-policy grid service for biological sequence comparison
TLDR
An adaptive task allocation framework to perform BLAST searches in a grid environment that provides an infrastructure that executes distributed BLAST genomic database comparisons and a mechanism to compute grid nodes' execution weight, adapting the chosen allocation policy to the observed computational power and local load of the nodes. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Identifying Candidate Disease Genes with High-Performance Computing
TLDR
A system to acquire and mine data from a subset of databases containing biological information that may contain data relevant to the identification of disease-causing genes is developed to aid the efforts to identify disease genes. Expand
Parallelization of local BLAST service on workstation clusters
TLDR
The current implementation is described which parallelizes batch requests, and the plans for implementation of the other levels is also described, which will ultimately be applied to hardware assistance for this soon-to-be primitive computer operation. Expand
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters
TLDR
QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist. Expand
Large scale hierarchical clustering of protein sequences
TLDR
This work clusters all known protein sequences hierarchically into superfamily and family clusters using graph-based algorithms that take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. Expand
ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches.
  • T. Rognes
  • Biology, Medicine
  • Nucleic acids research
  • 2001
TLDR
The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used and only the significantly less sensitive NCBI BLAST 2 program was foundTo outperform the new approach in terms of speed. Expand
Efficient data access for parallel BLAST
TLDR
This paper presents a set of techniques for efficient and flexible data handling in parallel sequence search applications, and shows that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST. Expand
Bio-sequence analysis with cradle's 3SoC™ software scalable system on chip
TLDR
A preliminary implementation of Smith-Waterman algorithm using a new chip multiprocessor architecture with multiple Digital Signal Processors on a single chip leading to high performance at low cost. Expand
A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homologyExpand
Piers: an efficient model for similarity search in DNA sequence databases
TLDR
It is shown theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity. Expand
DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors
TLDR
By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved and it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope. Expand
...
1
2
3
4
...