Scientific computing meets big data technology: An astronomy use case

@article{Zhang2015ScientificCM,
  title={Scientific computing meets big data technology: An astronomy use case},
  author={Zhao Zhang and Kyle Barbary and Frank A. Nothaft and Evan R. Sparks and Oliver Zahn and Michael J. Franklin and David A. Patterson and Saul Perlmutter},
  journal={2015 IEEE International Conference on Big Data (Big Data)},
  year={2015},
  pages={918-927}
}
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark - a modern big data platform - to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache… 

Figures and Tables from this paper

Kira: Processing Astronomy Imagery Using Big Data Technology
TLDR
This work implements Kira, a flexible and distributed astronomy image processing toolkit, and its Source Extractor application, and examines the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the Amazon EC2 cloud.
Analyzing astronomical data with Apache Spark
TLDR
The performances of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys are investigated and how to manage more complex binary data structures, such as those handled in astrophysics experiments, within a distributed environment is shown.
FITS Data Source for Apache Spark
TLDR
The performance of Apache Spark, a cluster computing framework, is investigated for analyzing data from future LSST-like galaxy surveys and how to manage complex binary data structures handled in astrophysics experiments such as binary tables stored in FITS files, within a distributed environment.
Using Thrill to Process Scientific Data on HPC
TLDR
Thrill is explored, a framework for big data computation on HPC clusters that provides an interface similar to systems like Apache Spark but delivers higher performance and implemented several operations to analyze data from plasma physics and molecular dynamics simulations.
Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models
TLDR
This paper proposes an architecture to support the integration of highly scalable MPI block-based data models and communication patterns with a map-reduce-based programming model and preserves the data abstraction and programming interface of Spark, but allows the user to delegate operations to the MPI layer.
On-demand data analytics in HPC environments at leadership computing facilities: Challenges and experiences
TLDR
This paper proposes an on-demand Spark service that mitigates difficulties, allowing facility users to flexibly create Spark instances quickly and easily, and defines a systematic approach for creating these Spark instances and validate that optimal performance benefits are maintained.
source computing framework unifies streaming , batch , and interactive big data workloads to unlock new applications
TLDR
Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing, using a programming model similar to MapReduce but extends it with a data-sharing abstraction called “Resilient Distributed Datasets,” or RDDs.
Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms
TLDR
This work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, it greatly improve the performance by efficient configuration of HPC resources and tuning of the application.
Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet
TLDR
Experiences and benefits of using optimized Remote Direct Memory Access (RDMA) Hadoop and Spark middleware on the XSEDE Comet HPC resource are discussed, including some performance results of Big Data benchmarks and applications.
...
...

References

SHOWING 1-10 OF 30 REFERENCES
Rethinking Data-Intensive Science Using Scalable Analytics Systems
TLDR
ADAM is described, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%.
Spark: Cluster Computing with Working Sets
TLDR
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins
Toward loosely coupled programming on petascale systems
  • I. Raicu, Zhao Zhang, Ben Clifford
  • Computer Science
    2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2008
TLDR
This work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications, and allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer.
MapReduce for Data Intensive Scientific Analyses
TLDR
This paper presents CGL-MapReduce, a streaming-based MapReduce implementation and compares its performance with Hadoop, and presents the experience in applying the MapReduced technique for two scientific data analyses: high energy physics data analyses; and K-means clustering.
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
TLDR
It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job.
FlumeJava: easy, efficient data-parallel pipelines
TLDR
The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines.
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications
TLDR
The proposed approach uses the MapReduce paradigm to parallelize tools and manage their execution, machine virtualization to encapsulate their execution environments and commonly used data sets into flexibly deployable virtual machines, and networkvirtualization to connect resources behind firewalls/NATs while preserving the necessary performance and the communication environment.
MapReduce: Simplified Data Processing on Large Clusters
TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
TLDR
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
...
...