Achieving 100,000,000 database inserts per second using Accumulo and D4M

@article{Kepner2014Achieving1D,
  title={Achieving 100,000,000 database inserts per second using Accumulo and D4M},
  author={Jeremy Kepner and William Arcand and David Bestor and Bill Bergeron and Chansup Byun and Vijay N. Gadepally and Matthew Hubbell and Peter Michaleas and Julie Mullen and Andrew Prout and A. Reuther and Antonio Rosa and Charles Yee},
  journal={2014 IEEE High Performance Extreme Computing Conference (HPEC)},
  year={2014},
  pages={1-6}
}
  • J. Kepner, W. Arcand, Charles Yee
  • Published 18 June 2014
  • Computer Science
  • 2014 IEEE High Performance Extreme Computing Conference (HPEC)
The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. [] Key Method The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100× larger than the highest previously published value for any other database. The performance scales linearly with the…

Figures from this paper

A Billion Updates per Second Using 30, 000 Hierarchical In-Memory D4M Databases
TLDR
This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets, with a sustained update rate of 1,900,000,000 updates per second.
1 Lustre , Hadoop , Accumulo
TLDR
Comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads, while Hadoop can provide 4x greater read bandwidth on special purpose workloading.
Lustre, hadoop, accumulo
TLDR
Comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads than Hadoop, which can provide 4x greater read bandwidth on special Purpose workloads.
Graphulo implementation of server-side sparse matrix multiply in the Accumulo database
TLDR
A server-side implementation of GraphBLAS sparse matrix multiplication that leverages Accumulo's native, high-performance iterators and offers its work as a core component to the Graphulo library that will deliver matrix math primitives for graph analytics within Accumulus.
A database-based distributed computation architecture with Accumulo and D4M: An application of eigensolver for large sparse matrix
TLDR
This paper presents a novel database-based distributed computation architecture bridging the gap between Hadoop and HPC, and is proved to be lighter, easier and faster than MapReduce based approach.
Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M
TLDR
This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array.
From NoSQL Accumulo to NewSQL Graphulo: Design and utility of graph algorithms inside a BigTable database
TLDR
This article shows how it is possible to implement the GraphBLAS kernels in a BigTable database by presenting the design of Graphulo, a library for executing graph algorithms inside the Apache Accumulo database, and details the Graphulo implementation of two graph algorithms.
Assessment of Multiple Ingest Strategies for Accumulo Key-Value Store
TLDR
This thesis will give an overview knowledge on the aforementioned noSQL systems and will delve into a more specific instance of them which is Accumulo key-value store, which is not designed with an ingest interface for users.
Enabling on-demand database computing with MIT SuperCloud database management system
TLDR
The MIT SuperCloud database management system allows for rapid creation and flexible execution of a variety of the latest scientific databases, including Apache Accumulo and SciDB, and permits snapshotting of databases to allow researchers to experiment and push the limits of the technology without concerns for data or productivity loss.
Hyperscaling Internet Graph Analysis with D4M on the MIT SuperCloud
TLDR
This work has implemented a representative analytics pipeline in D4M and benchmarked it on 96 hours of Gigabit PCAP data with MIT SuperCloud and achieved speedups of over 20,000.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Understanding query performance in Accumulo
TLDR
An Apache Accumulo-based big data system designed for a network situational awareness application is studied and its storage schema and data retrieval requirements are analyzed, and the correspondingAccumulo performance bottlenecks are characterized.
Benchmarking Apache Accumulo BigData Distributed Table Store Using Its Continuous Test Suite
TLDR
The benchmark study investigated sustained continuous mode stress testing and identified optimum configurations for very high-throughput data ingest, sequential and random query operations and the unique feature of cell level data access security in Apache Accumulo.
D4M 2.0 schema: A general purpose high performance schema for the Accumulo database
TLDR
This paper presents the D4M 2.0 Schema, a general purpose schema that can be used to fully index and rapidly query every unique string in a dataset, which has been applied with little or no customization to cyber, bioinformatics, scientific citation, free text, and social media data.
Driving big data with big compute
TLDR
The LLGrid team has developed and deployed a number of technologies that aim to provide the best of both worlds, including LLGrid MapReduce, which allows the map/reduce parallel programming model to be used quickly and efficiently in any language on any compute cluster.
Dynamic distributed dimensional data model (D4M) database and computation system
  • J. Kepner, W. Arcand, Charles Yee
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
TLDR
D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language “SQL” databases) and it is possible to create composable analytics with significantly less effort than using traditional approaches.
Bigtable: A Distributed Storage System for Structured Data
TLDR
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
Dynamo: amazon's highly available key-value store
TLDR
D Dynamo is presented, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience and makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
LLSuperCloud: Sharing HPC systems for diverse rapid prototyping
TLDR
LLSuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capability on a supercomputer, resulting in a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases.
Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems
TLDR
The SSCA benchmarks are envisioned to emerge as complements to current scalable micro-benchmarks and complex real applications to measure high-end productivity and system performance and are described in sufficient detail to drive novel HPC programming paradigms as well as architecture development and testing.
'pMATLAB Parallel MATLAB Library'
TLDR
The overall design and architecture of the pMatlab implementation is described and it is shown that users are typically able to go from a serial code to an efficient pMat lab code in about 3 hours while changing less than 1% of their code.
...
1
2
...