Characterization and Optimization of Memory-Resident MapReduce on HPC Systems

@article{Wang2014CharacterizationAO,
  title={Characterization and Optimization of Memory-Resident MapReduce on HPC Systems},
  author={Yandong Wang and Robin Goldstone and Weikuan Yu and Teng Wang},
  journal={2014 IEEE 28th International Parallel and Distributed Processing Symposium},
  year={2014},
  pages={799-808}
}
MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric paradigm such as MapReduce. This work characterizes the performance impact of key… 
Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems
TLDR
This paper addresses a critical question on how to accelerate complex application that contains both data-intensive and compute-intensive workloads on the Tianhe-2 system by deploying an in-memory file system as data access middleware, and proposes shared map output shuffle strategy and file metadata cache layer to alleviate the impact of metadata bottleneck.
Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?
TLDR
This paper designs a novel MapReduce framework with NVRAM-assisted map output spill approach on top of the existing RDMA-enhanced Hadoop Map Reduce to ensure both map and reduce phase performance enhancements to be present for end applications.
Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC Systems
TLDR
This work presents an enhanced Big Data execution framework, HOMR (Hybrid Overlapping in Map Reduce), which improves the MapReduce job execution pipeline by maximizing overlapping among execution phases and proposes different deployment architectures while utilizing Lustre as underlying storage and provides fast shuffle strategies with dynamic adjustments.
Memory-Efficient and Skew-Tolerant MapReduce Over MPI for Supercomputing Systems
Data analytics has become an integral part of large-scale scientific computing. Among various data analytics frameworks, MapReduce has gained the most traction. Although some efforts have been made
Accelerating big data analytics on HPC clusters using two-level storage
Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems
TLDR
Write Handle Reusing is proposed to fully utilize memory in shuffle file writing and reading and Load Balancing Optimizer is introduced to ensure the even distribution of data processing across all worker nodes, and Memory-Aware Task Scheduler that coordinates concurrency level and memory usage is also developed to prevent memory spilling.
Big data analytics on traditional HPC infrastructure using two-level storage
TLDR
A new two-level storage system is developed by integrating Tachyon, an in-memory file system with OrangeFS, a parallel file system that can increase the aggregate I/O throughputs.
On the Performance of Spark on HPC Systems: Towards a Complete Picture
TLDR
An experimental campaign is conducted to provide a clearer understanding of the performance of Spark, the de facto in-memory data processing framework, on HPC systems to evaluate how the latency, contention and file system's configuration can influence the application performance.
Horme: Random Access Big Data Analytics
TLDR
This work proposes a solution that builds off of MapReduce for use on a HPC system that preserves the key-value semantics of map-reduce while supporting the random access of query access for subsetting Big Data datasets, and at same time hosting the service using the storage medium found in HPC architectures (parallel file systems) for reduced latencies.
Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation
TLDR
This paper sheds light upon how big data frameworks can be ported to HPC platforms as a preliminary step towards the convergence of big data and exascale computing ecosystem.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Improving MapReduce Performance in Heterogeneous Environments
TLDR
A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
A throughput optimal algorithm for map task scheduling in mapreduce with data locality
TLDR
A new queueing architecture is presented and a map task scheduling algorithm constituted by the Join the Shortest Queue policy together with the MaxWeight policy is proposed, which is throughput optimal and the outer bound coincides with the actual capacity region.
AccelerAting And Simplifying ApAcheTM hAdoop® with pAnASAS® ActiveStor®
  • Computer Science
  • 2013
TLDR
This paper will show that analytics performance can actually be enhanced with Panasas ActiveStor storage, and is particularly relevant for institutions who have already invested in compute clusters for other big data workloads as they can now run Hadoop on their existing compute infrastructure in conjunction with theirPanasasactiveStor enterprise-class storage system.
Cloud Analytics: Do We Really Need to Reinvent the Storage Stack?
TLDR
This paper revisits the debate on the need of a new non-POSIX storage stack for cloud analytics and argues, based on an initial evaluation, that it can be built on traditional POSIX-based cluster filesystems.
MapReduce: Simplified Data Processing on Large Clusters
TLDR
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Evaluation of HPC Applications on Cloud
TLDR
The results show that Cloud is viable platform for some applications, specifically, non communicationintensive applications such as embarrassingly parallel and tree-structured computations up to high processor count and for communication-intensive applications up to low processor count.
Shark: SQL and rich analytics at scale
TLDR
Shark is a new data analysis system that marries query processing with complex analytics on large clusters and extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL.
CooMR: Cross-task coordination for efficient data management in MapReduce programs
TLDR
A cross-task coordination framework called CooMR is designed for efficient data management in MapReduce programs and is able to increase task coordination, improve system resource utilization, and significantly speed up the execution time of Map Reduce programs.
Hadoop acceleration through network levitated merge
TLDR
Hadoop-A, an acceleration framework that optimizes Hadoop with plugin components implemented in C++ for fast data movement, overcoming its existing limitations is described, including a novel network-levitated merge algorithm.
A comparative study of high-performance computing on the cloud
TLDR
This paper compares the top-of-the-line EC2 cluster to HPC clusters at Lawrence Livermore National Laboratory (LLNL) based on turnaround time and total cost of execution, and observes that the cost-effectiveness of running an application on a cluster depends on raw performance and application scalability.
...
1
2
3
...