Pilot-Streaming: A Stream Processing Framework for High-Performance Computing

@article{Luckow2018PilotStreamingAS,
  title={Pilot-Streaming: A Stream Processing Framework for High-Performance Computing},
  author={Andr{\'e} Luckow and George Chantzialexiou and Shantenu Jha},
  journal={2018 IEEE 14th International Conference on e-Science (e-Science)},
  year={2018},
  pages={177-188}
}
An increasing number of scientific applications utilize stream processing to analyze data feeds of scientific instruments, sensors, and simulations. [] Key Method To address the complexity in the development of streaming applications, we present the Streaming Mini-Apps, which supports different plug-able algorithms for data generation and processing, e. g., for reconstructing light source images using different techniques. We use the streaming Mini-Apps to evaluate the Pilot-Streaming framework demonstrating…

Figures and Tables from this paper

Generalizing Streaming Pipeline Design for Big Data
TLDR
This paper envision and implement a generalized minimal stream processing pipeline and measure its performance, on some data sets, in the form of delays and latencies of data arrival at pivotal checkpoints in the pipeline, using a Docker™ container without much loss in performance.
HFlow: A Dynamic and Elastic Multi-Layered I/O Forwarder
TLDR
HFlow is presented, a new class of data forwarding system that leverages a real-time data movement paradigm providing data-independent tasks that can be executed anywhere and thus, enabling dynamic resource provisioning and shows an increase in performance of 3x when compared with state-of-the-art software solutions.
Reliable and Energy-aware Mapping of Streaming Series-parallel Applications onto Hierarchical Platforms
TLDR
A dynamic programming algorithm is derived for the special case of linear chains, which provides an interesting heuristic and a building block for designing heuristics for the general case of streaming applications.
Pilot-Edge: Distributed Resource Management Along the Edge-to-Cloud Continuum
TLDR
Pilot-Edge is a common abstraction for resource management across the edge-to-cloud continuum based on the pilot abstraction, which decouples resource and workload management, and provides a Function-as-a-Service (FaaS) interface for application-level tasks.
Quality Model for Evaluating and Choosing a Stream Processing Framework Architecture
TLDR
This paper proposes an assessment quality model to evaluate and choose stream processing frameworks, and presents a decision tree to support engineers to choose a framework following the quality aspects.
Contributions to High-Performance Big Data Computing
TLDR
The base architecture including the HPC-ABDS, High-Performance Computing enhanced Apache Big Data Stack, and an application use case study identifying key features that determine software and algorithm requirements are described.
Methods and Experiences for Developing Abstractions for Data-intensive, Scientific Applications
  • André Luckow, S. Jha
  • Computer Science
    2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2020
TLDR
This work addresses the critical problem of distributed resource management on heterogeneous infrastructure over a dynamic range of scales, a challenge that currently limits many scientific applications and shows how DSR provides a well-defined framework for developing abstractions and middleware systems for distributed systems.
Tomographic Reconstruction of Dynamic Features with Streaming Sliding Subsets
TLDR
The system enables runtime system parameters to be adjusted dynamically over the course of experiment, providing opportunities for balancing the quality and computational demands of tasks, better observation of phenomena, and improving advanced experimental techniques such as autonomous experimental steering.
High-Performance Ptychographic Reconstruction with Federated Facilities
TLDR
This work presents a system that unifies leadership computing and experimental facilities by enabling the automated establishment of data analysis pipelines that extend from edge data acquisition systems at synchrotron beamlines to remote computing facilities; under the covers, the system uses Globus Auth authentication to minimize user interaction.
A Study Review of Common Big Data Architecture for Small-medium Enterprise
TLDR
The survey will emphasize that many big data components could help the small-medium enterprise to tackle their big data operational issues.
...
...

References

SHOWING 1-10 OF 55 REFERENCES
Towards High Performance Processing of Streaming Data in Large Data Centers
TLDR
This research implements efficient highly scalable communication algorithms and presents a comprehensive study of performance, taking into account the nature of these applications and characteristics of the cloud runtime environments, and reduces communication costs within a node using an efficient shared memory approach.
Low Latency Stream Processing : Twitter Heron with Infiniband and Omni-Path
TLDR
The authors present their findings on integrating Twitter Heron distributed stream processing system with two high performance interconnects; Infiniband and Intel Omni-Path.
Discretized streams: fault-tolerant streaming computation at scale
TLDR
D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers, and can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes.
Pilot-Data: An abstraction for distributed data
Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures?
HPC environments have traditionally been designed to meet the compute demand of scientific applications and data has only been a second order concern. With science moving toward data-driven
A Distributed Message Delivery Infrastructure for Connected Vehicle Technology Applications
TLDR
These experiments reveal that measured latencies are less than the U.S. Department of Transportation recommended latency requirements for CV applications, which prove the efficacy of the system for CV related data distribution and management tasks.
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
TLDR
One such approach is presented, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.
Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management
TLDR
This paper proposes extensions to the Pilot-Abstraction so as to provide a unifying resource management layer that provides an important step towards integration and thereby interoperable use of HPC and Hadoop/Spark, and allows applications to efficiently couple HPC stages.
RADICAL-Pilot: Scalable Execution of Heterogeneous and Dynamic Workloads on Supercomputers
TLDR
RADICAL-Pilot is introduced, a scalable and interoperable pilot system that faithfully implements the Pilot abstraction, and its task execution component (the RP Agent) is characterized, which is engineered for optimal resource utilization while maintaining the full generality of theilot abstraction.
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
TLDR
This work analyzes the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm, and proposes a basis, common terminology and functional factors upon which to analyze the twoapproaches.
...
...