Driving big data with big compute

@article{Byun2012DrivingBD,
  title={Driving big data with big compute},
  author={Chansup Byun and William Arcand and David Bestor and Bill Bergeron and Matthew Hubbell and Jeremy Kepner and Andrew McCabe and Peter Michaleas and Julie Mullen and David O'Gwynn and Andrew Prout and A. Reuther and Antonio Rosa and Charles Yee},
  journal={2012 IEEE Conference on High Performance Extreme Computing},
  year={2012},
  pages={1-6}
}
  • C. Byun, W. Arcand, Charles Yee
  • Published 1 September 2012
  • Computer Science
  • 2012 IEEE Conference on High Performance Extreme Computing
Big Data (as embodied by Hadoop clusters) and Big Compute (as embodied by MPI clusters) provide unique capabilities for storing and processing large volumes of data. Hadoop clusters make distributed computing readily accessible to the Java community and MPI clusters provide high parallel efficiency for compute intensive workloads. Bringing the big data and big compute communities together is an active area of research. The LLGrid team has developed and deployed a number of technologies that aim… 
LLMapReduce: Multi-level map-reduce for high performance data analysis
TLDR
LLMapReduce dramatically simplifies map- reduce programming by providing simple parallel programming capability in one line of code, and can overcome scaling limits in the map-reduce parallel programming model via options that allow the user to switch to the more efficient single-program-multiple-data (SPMD) parallel Programming model.
Dual direction big data download and analysis
TLDR
A new approach to enhance the overall big data analysis performance by simultaneously parallelizing the download of the data from multiple replicated sites to multiple compute nodes that will also perform the analysis in parallel.
Týr: Storage-Based HPC and Big Data Convergence Using Transactional Blobs
TLDR
This thesis proposes the key design principles of Týr, a converging storage system designed to answer the needs of both HPC and BDA applications, natively offering data access coordination in the form of transactions, and demonstrates the relevance and efficiency of its design in the light of convergence in multiple applicative contexts from both communities.
Tweet Analysis: Twitter Data processing Using Apache Hadoop
TLDR
A way of analyzing of big data such as twitter data using Apache Hadoop which will process and analyze the tweets on a Hadooper clusters is provided, which also includes visualizing the results into a pictorial representations of twitter users and their tweets.
Taming Biological Big Data with D 4 M
TLDR
MIT Lincoln Laboratory computer scientists demonstrated how a new Laboratory-developed technology, the Dynamic Distributed Dimensional Data Model (D4M), can be used to accelerate DNA sequence comparison, a core operation in bioinformatics.
LLSuperCloud: Sharing HPC systems for diverse rapid prototyping
TLDR
LLSuperCloud reverses the traditional paradigm of attempting to deploy supercomputing capabilities on a cloud and instead deploys cloud capability on a supercomputer, resulting in a system that can handle heterogeneous, massively parallel workloads while also providing high performance elastic computing, virtualization, and databases.
Achieving 100,000,000 database inserts per second using Accumulo and D4M
TLDR
The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications and has a peak performance of over 100,000,000 database inserts per second which is 100× larger than the highest previously published value for any other database.
Lustre, hadoop, accumulo
TLDR
Comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads than Hadoop, which can provide 4x greater read bandwidth on special Purpose workloads.
1 Lustre , Hadoop , Accumulo
TLDR
Comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads, while Hadoop can provide 4x greater read bandwidth on special purpose workloading.
Portable Map-Reduce Utility for MIT SuperCloud Environment
TLDR
An option to consolidate multiple input data files per compute task as a single stream of input with minimal changes to the target application enables users to reduce the computational overhead associated with the cost of multiple application starting up when dealing with more than one piece of input data per computetask.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 14 REFERENCES
Dynamic distributed dimensional data model (D4M) database and computation system
  • J. Kepner, W. Arcand, Charles Yee
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
TLDR
D4M (Dynamic Distributed Dimensional Data Model) has been developed to provide a mathematically rich interface to tuple stores (and structured query language “SQL” databases) and it is possible to create composable analytics with significantly less effort than using traditional approaches.
'pMATLAB Parallel MATLAB Library'
TLDR
The overall design and architecture of the pMatlab implementation is described and it is shown that users are typically able to go from a serial code to an efficient pMat lab code in about 3 hours while changing less than 1% of their code.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
TLDR
The results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
YCSB++: benchmarking and performance debugging advanced features in scalable table stores
TLDR
YCSB++ is described, a set of extensions to the Yahoo! Cloud Serving Benchmark that includes multi-tester coordination for increased load and eventual consistency measurement, multi-phase workloads to quantify the consequences of work deferment and the benefits of anticipatory configuration optimization, and abstract APIs for explicit incorporation of advanced features in benchmark tests.
pMATLAB: Parallel MATLAB Library for Signal Processing Applications
TLDR
The pMatlab design and results for the HPC Challenge benchmark suite are presented, which simplify parallel programming by removing the need for explicit message passing.
MatlabMPI
Parallel MATLAB - for Multicore and Multinode Computers
  • J. Kepner
  • Computer Science
    Software, environments, tools
  • 2009
TLDR
Parallel MATLAB for Multicore and Multinode Computers covers more parallel algorithms and parallel programming models than any other parallel programming book due to the succinctness of MATLAB.
Interactive Grid Computing at Lincoln Laboratory
TLDR
This paper aims to provide a well-defined and repeatable process for migrating from serial to parallel code, and provide a simple mechanism for using lower-level communication when necessary for pMatlab constructs.
Apache Accumulo
  • Apache Accumulo
Apache HBase http://hbase.apache.org
  • Apache HBase http://hbase.apache.org
...
1
2
...