Bridging the Gap between HPC and Big Data frameworks

@article{Anderson2017BridgingTG,
  title={Bridging the Gap between HPC and Big Data frameworks},
  author={Michael J. Anderson and Shaden Smith and Narayanan Sundaram and Mihai Capotă and Zheguang Zhao and Subramanya R. Dulloor and Nadathur Satish and Theodore L. Willke},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={10},
  pages={901-912}
}
Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this… Expand
Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models
TLDR
This paper proposes an architecture to support the integration of highly scalable MPI block-based data models and communication patterns with a map-reduce-based programming model and preserves the data abstraction and programming interface of Spark, but allows the user to delegate operations to the MPI layer. Expand
Approaches of enhancing interoperations among high performance computing and big data analytics via augmentation
TLDR
This paper sheds light upon how big data frameworks can be ported to HPC platforms as a preliminary step towards the convergence of big data and exascale computing ecosystem. Expand
Is Intel high performance analytics toolkit a good alternative to Apache Spark?
TLDR
This paper compares the performance and stability of two Big Data processing tools: the Apache Spark and the High Performance Analytics Toolkit and concludes that HPAT has performance improvements in relation to Apache Spark. Expand
Accelerating Spark-Based Applications with MPI and OpenACC
TLDR
A Hybrid Spark MPI OpenACC (HSMO) system is proposed for integrating Spark as a big data programming model, with MPI and OpenACC as parallel programming models, and a mapping technique is proposed that is built based on the application’s virtual topology as well as the physical topology of the undelaying resources. Expand
Alchemist: An Apache Spark ⇔ MPI interface
TLDR
The motivation behind the development of Alchemist is discussed, the efficiency of the approach on medium‐to‐large data sets is demonstrated, and the performance of Spark with that of Spark+Alchemist is compared. Expand
Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY
TLDR
This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. Expand
Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist
TLDR
Alchemist is introduced, a system designed to call MPI-based libraries from Apache Spark to accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. Expand
A Proposed Architecture for Parallel HPC-based Resource Management System for Big Data Applications
TLDR
The objective of this research paper is to enhance the performance of big data applications on HPC clusters without sacrificing the power consumption of HPC by building a parallel HPC-based Resource Management System to exploit the capabilities ofHPC resources efficiently. Expand
A Case Against Tiny Tasks in Iterative Analytics
TLDR
An alternative approach is proposed that relies on an auto-parallelizing compiler tightly integrated with the MPI runtime, illustrating the opposite end of the spectrum where task granularities are as large as possible. Expand
Performance Evaluation of Apache Spark Vs MPI: A Practical Case Study on Twitter Sentiment Analysis
TLDR
Results shown that MPI outperforms Apache Spark in parallel and distributed cluster computing environments and hence the higher performance of MPI can be exploited in big data applications for improving speedups. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf
TLDR
Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance, however, Spark shows better data manage- ment infrastructure and the possibility of dealing with other aspects such as node failure and data replication. Expand
SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform
TLDR
This work addresses the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators and refers to the approach as Spark With Accelerated Tasks (SWAT), an accelerated data analytics framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors. Expand
DataMPI: Extending MPI to Hadoop-Like Big Data Computing
TLDR
This paper abstracts the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features, and proposes the specification of a minimalistic extension to MPI. Expand
Making Sense of Performance in Data Analytics Frameworks
TLDR
It is found that CPU (and not I/O) is often the bottleneck, and improving network performance can improve job completion time by a median of at most 2%, and the causes of most stragglers can be identified. Expand
Matrix Computations and Optimization in Apache Spark
TLDR
A comprehensive set of benchmarks for hardware accelerated matrix computations from the J VM, which is interesting in its own right, as many cluster programming frameworks use the JVM. Expand
A Framework for Elastic Execution of Existing MPI Programs
TLDR
This paper describes the initial work towards the goal of making existing MPI applications elastic for a cloud framework, and demonstrates the feasibility of the approach, and shows that outputting, redistributing, and reading back data can be a reasonable approach for making existingmpi applications elastic. Expand
Thrill: High-performance algorithmic distributed batch data processing with C++
TLDR
The design and a first performance evaluation of Thrill are presented — a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface based on C++ which enables performance advantages due to direct native code compilation, a more cache-friendly memory layout, and explicit memory management. Expand
Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
TLDR
This work explores the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms, and examines three widely-used and important matrix factorizations: NMF, PCA and CX. Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
TLDR
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
TLDR
This work analyzes the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm, and proposes a basis, common terminology and functional factors upon which to analyze the twoapproaches. Expand
...
1
2
3
4
5
...