Genetic sequence matching using D4M big data approaches

  title={Genetic sequence matching using D4M big data approaches},
  author={Stephanie Dodson and Darrell O. Ricke and Jeremy Kepner},
  journal={2014 IEEE High Performance Extreme Computing Conference (HPEC)},
  • S. Dodson, D. Ricke, J. Kepner
  • Published 25 July 2014
  • Computer Science
  • 2014 IEEE High Performance Extreme Computing Conference (HPEC)
Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln… 

Figures from this paper

A highly parallel next-generation DNA sequencing data analysis pipeline in Hadoop
A highly parallel end-to-end next-generation DNA sequencing data analysis pipeline in Hadoop is developed that will allow large cohort populations to be analyzed in parallel, and can fundamentally change the way DNA sequencing analyses are used by both researchers and clinicians.
Rapid sequence identification of potential pathogens using techniques from sparse linear algebra
D4RAGenS is presented, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M).
Implementing Suffix Array Algorithm Using Apache Big Table Data Implementation
It is demonstrated how it is possible to refactor a well-known algorithm coupled by taking advantage of an high-performance distributed datastore, to illustrate the advantages of usingdatastore cloud related technology for storing large text sequences and retrieving them.
Big Data Tools, Technologies, and Applications: A Survey
This chapter critically analyze some of the core applications of big data and their impacts in improving the quality of human life by primarily focusing on healthcare and smart city applications, genome sequence annotation applications, and graph-based applications.
Using a Big Data Database to Identify Pathogens in Protein Data Space
This paper explores using big data database technologies to characterize very large metagenomic DNA sequences in protein space, with the ultimate goal of rapid pathogen identification in patient samples.
Demonstrating the BigDAWG Polystore System for Ocean Metagenomics Analysis
This demonstration will show the BigDAWG system and a number of polystore applications built to help ocean metagenomics researchers handle their heterogenous Big Data.
Julia implementation of the Dynamic Distributed Dimensional Data Model
This work presents an implementation of D4M in Julia and describes how it enables and facilitates data analysis, and showcases scalable performance in the new Julia version as compared to the original Matlab implementation. Raising the Bar on Graph Analytic Performance
Graph Challenge 2017 received 22 submissions by 111 authors from 36 organizations and highlighted graph analytic innovations in hardware, software, algorithms, systems, and visualization that produced many comparable performance measurements that can be used for assessing the current state of the art of graph analysis. Triangle Counting Performance
These submissions show that their state-of-the-art triangle counting execution time is a strong function of the number of edges in the graph, which improved significantly from 2017 to 2018 and remained comparable from 2018 to 2019.
Design, Generation, and Validation of Extreme Scale Power-Law Graphs
  • J. Kepner, S. Samsi, A. Reuther
  • Computer Science, Mathematics
    2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2018
This paper presents a novel approach that uses Kronecker products to allow the exact computation of graph properties prior to graph generation, which can be generated quickly in memory on a parallel computer with no-interprocessor communication.


Taming Biological Big Data with D 4 M
MIT Lincoln Laboratory computer scientists demonstrated how a new Laboratory-developed technology, the Dynamic Distributed Dimensional Data Model (D4M), can be used to accelerate DNA sequence comparison, a core operation in bioinformatics.
FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets
FASTQSim enables users to assess the quality of NGS datasets and allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.
Dynamic distributed dimensional data model (D4M) database and computation system
  • J. Kepner, W. Arcand, Charles Yee
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language “SQL” databases) and it is possible to create composable analytics with significantly less effort than using traditional approaches.
D4M 2.0 schema: A general purpose high performance schema for the Accumulo database
This paper presents the D4M 2.0 Schema, a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset, which has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data.
Basic local alignment search tool.
Identification of common molecular subsequences.
Parallel MATLAB - for Multicore and Multinode Computers
  • J. Kepner
  • Computer Science
    Software, environments, tools
  • 2009
Parallel MATLAB for Multicore and Multinode Computers covers more parallel algorithms and parallel programming models than any other parallel programming book due to the succinctness of MATLAB.
The zebrafish reference genome sequence and its relationship to the human genome
A high-quality sequence assembly of the zebrafish genome is generated, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map, providing a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebra fish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.
Basic Local Alignment Search Tool (BLAST)
BLAST is a heuristic method to find the highest scoring locally optimal alignments between a query sequence and a database sequence to predict the identity, function, 3D structure of the query sequence.
The Statistics of Sequence Similarity Scores
  • Biology
  • 2002
To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of