• Published 2016

H 5 Spark : Bridging the I / O Gap between Spark and Scientific Data Formats on HPC Systems

@inproceedings{Liu2016H5S,
  title={H 5 Spark : Bridging the I / O Gap between Spark and Scientific Data Formats on HPC Systems},
  author={Jialin Liu and Evan Racah and Quincey Koziol and Richard Shane Canon and Alex Gittens and Lisa Gerhardt and Suren Byna and Mike Ringenburg and Prabhat},
  year={2016}
}
The Spark framework has been tremendously powerful for performing Big Data analytics in distributed data centers. However, using Spark to analyze large-scale scientific data on HPC systems has several challenges. For instance, parallel file systems are shared among all computing nodes, in contrast to shared-nothing architectures. Additionally, accessing data stored in commonly used scientific data formats, such as HDF5 and netCDF, is not natively supported in Spark. Our study focuses on… CONTINUE READING

Figures and Tables from this paper.

Citations

Publications citing this paper.
SHOWING 1-10 OF 10 CITATIONS

Matrix factorizations at scale: A comparison of scientific data analytics in spark and C+MPI using three case studies

  • 2016 IEEE International Conference on Big Data (Big Data)
  • 2016
VIEW 5 EXCERPTS
CITES METHODS

SciAP: A Programmable, High-Performance Platform for Large-Scale Scientific Data

  • 2018 International Conference on Cloud Computing, Big Data and Blockchain (ICCBB)
  • 2018
VIEW 1 EXCERPT
CITES BACKGROUND

SciDP: Support HPC and Big Data Applications via Integrated Scientific Data Processing

  • 2018 IEEE International Conference on Cluster Computing (CLUSTER)
  • 2018
VIEW 3 EXCERPTS
CITES BACKGROUND

ArrayUDF: User-Defined Scientific Data Analysis on Arrays

  • HPDC
  • 2017
VIEW 2 EXCERPTS
CITES BACKGROUND & METHODS

Dynamic Management of In-Memory Storage for Efficiently Integrating Compute-and Data-Intensive Computing on HPC Systems

  • 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
  • 2017
VIEW 1 EXCERPT
CITES METHODS

References

Publications referenced by this paper.
SHOWING 1-10 OF 16 REFERENCES

MLlib: Machine Learning in Apache Spark

  • J. Mach. Learn. Res.
  • 2015
VIEW 3 EXCERPTS
HIGHLY INFLUENTIAL

NERSC 2014 workload analysis, http://portal.nersc.gov/project/mpccc/baustin/nersc 2014 workload analysis v1.1.pdf

Brian Austin
  • 2014
VIEW 1 EXCERPT

SciHadoop: Array-based query processing in Hadoop

  • 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
  • 2011
VIEW 1 EXCERPT