Corpus ID: 235755316

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

  title={Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark},
  author={Ylenia Galluzzo and Raffaele Giancarlo and M{\'a}rio Luciano Randazzo and Simona E. Rombo},
With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset… Expand

Figures and Tables from this paper


BigBWA: approaching the Burrows-Wheeler aligner to Big Data technologies
BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows-Wheeler aligner (BWA). Expand
A New Burrows Wheeler Transform Markov Distance
Unlike other compression-based distance metrics known to us, the new Burrows Wheeler Markov Distance works by embedding sequences into a fixed-length feature vector, which allows it to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods. Expand
Rapid parallel genome indexing with MapReduce
A novel parallel algorithm for constructing the suffix array and the BWT of a sequence leveraging the unique features of the MapReduce parallel programming model is presented and the end-to-end runtime is reduced from hours to mere minutes. Expand
Parallel distributed memory construction of suffix and longest common prefix arrays
  • P. Flick, S. Aluru
  • Computer Science
  • SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2015
This paper presents parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand
Replacing suffix trees with enhanced suffix arrays
This article shows how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that use an enhanced suffix array and solves the same problem in the same time complexity. Expand
Indexing Compressed Text
This thesis implements an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts and shows that the index is compact and supports fast search. Expand
Next-generation sequencing: big data meets high performance computing.
To make effective use of the produced data, the design of big data algorithms and their efficient implementation on modern high performance computing systems is required. Expand
Linear work suffix array construction
A generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff and is asymptotically faster than all previous suffix tree or array construction algorithms. Expand
MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping
A neural-network based training scheme is provided to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. Expand