• Corpus ID: 32653703

H 5 Spark : Bridging the I / O Gap between Spark and Scientific Data Formats on HPC Systems

  title={H 5 Spark : Bridging the I / O Gap between Spark and Scientific Data Formats on HPC Systems},
  author={Jialin Liu and Evan Racah and Quincey Koziol and Richard Shane Canon and Alex Gittens and Lisa Marie Gerhardt and Surendra Byna and Michael F. Ringenburg and Prabhat},
The Spark framework has been tremendously powerful for performing Big Data analytics in distributed data centers. However, using Spark to analyze large-scale scientific data on HPC systems has several challenges. For instance, parallel file systems are shared among all computing nodes, in contrast to shared-nothing architectures. Additionally, accessing data stored in commonly used scientific data formats, such as HDF5 and netCDF, is not natively supported in Spark. Our study focuses on… 

Figures and Tables from this paper

Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion

This paper addresses the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework, and investigates the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data.

FITS Data Source for Apache Spark

The performance of Apache Spark, a cluster computing framework, is investigated for analyzing data from future LSST-like galaxy surveys and how to manage complex binary data structures handled in astrophysics experiments such as binary tables stored in FITS files, within a distributed environment.

DynIMS: A Dynamic Memory Controller for In-memory Storage on HPC Systems

A dynamic memory controller, DynIMS, is developed, which infers memory demands of compute tasks online and employs a feedback-based control model to adapt the capacity of in-memory storage.

SDAC: Porting Scientific Data to Spark RDDs

SDAC (Scientific Data Auto Chunk) is introduced for porting various scientific data to RDDs to support parallel processing and analytics in Apache Spark framework with the integration of auto-chunk task granularity-specify method.

SciDP: Support HPC and Big Data Applications via Integrated Scientific Data Processing

Experimental results show that SciDP accelerates analysis and visualization of a production NASA Center for Climate Simulation (NCCS) climate and weather application by 6x to 8x when compared to existing solutions.

Analyzing astronomical data with Apache Spark

The performances of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys are investigated and how to manage more complex binary data structures, such as those handled in astrophysics experiments, within a distributed environment is shown.

SciAP: A Programmable, High-Performance Platform for Large-Scale Scientific Data

SciAP enables specific-domain scientists to natively execute Spark programs and applications for processing and analyzing scientific data on HPC environment, and uses model-driven way to extract abstract models from heterogeneous scientific data formats, and ultimately provides a unified interface to access scientific raw data.

Bringing the HPC reconstruction algorithms to Big Data platforms

  • N. Malitsky
  • Computer Science
    2016 New York Scientific Data Summit (NYSDS)
  • 2016
This paper focuses on the Spark-based integration of the SHARP distributed ptychographic solver developed by the team of the Center for Advanced Mathematics for Research Applications (CAMERA) at Berkeley, and proposes an excellent reference use case that captures the major technical aspects of other beamline applications.

Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

This work explores the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms, and examines three widely-used and important matrix factorizations: NMF, PCA and CX.

Spark and HPC for High Energy Physics Data Analyses

A use case focuses on searching for new types of elementary particles explaining Dark Matter in the universe, and the benefits and limitations of using Spark with HDF5 on Edison at NERSC.



Scaling Spark on HPC Systems

The results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation, and scalability is currently limited to O(102) cores in aHPC installation with Lustre and default Spark.

Tuning HDF5 for Lustre File Systems

It is demonstrated that the combined optimizations improve HDF5 parallel I/O performance by up to 33 times in some cases running close to the achievable peak performance of the underlying file system and scalable performance up to 40,960-way concurrency.

Spark: Cluster Computing with Working Sets

Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

Locality-driven high-level I/O aggregation for processing scientific datasets

The proposed locality-driven highlevel I/O aggregation approach holds a promise for efficiently processing scientific datasets, which is critical for the data intensive or big data computing era.

Data sieving and collective I/O in ROMIO

  • R. ThakurW. GroppE. Lusk
  • Computer Science
    Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation
  • 1999
This work describes how the MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests and explains in detail the two key optimizations ROMIO performs: data sieving for non Contiguous requests from one process and collective I/O for noncont contiguous requests from multiple processes.

Hierarchical Collective I/O Scheduling for High-Performance Computing

SciHadoop: Array-based query processing in Hadoop

  • Joe B. BuckNoah Watkins S. Brandt
  • Computer Science
    2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
  • 2011
This work describes the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantifies the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads.

MLlib: Machine Learning in Apache Spark

MLlib is presented, Spark's open-source distributed machine learning library that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.

Collective caching: application-aware client-side file caching

This paper proposes the idea of "collective caching" which coordinates the application processes to manage cache data and achieve cache coherence without involving the I/O servers and implemented a collective caching subsystem at user space as a library, which can be incorporated into any message passing interface implementation to increase its portability.

SciSpark: Applying in-memory distributed computing to weather event detection and tracking

SciSpark, a Big Data framework that extends Apache™ Spark for scaling scientific computations, is presented and aspects of the Grab 'em Tag 'em Graph 'em algorithm are implemented using SciSpark and its Map Reduce capabilities.