Scalable parallel building blocks for custom data analysis

@article{Peterka2011ScalablePB,
  title={Scalable parallel building blocks for custom data analysis},
  author={T. Peterka and R. Ross and W. Kendall and A. Gyulassy and Valerio Pascucci and Han-Wei Shen and Teng-Yok Lee and A. Chaudhuri},
  journal={2011 IEEE Symposium on Large Data Analysis and Visualization},
  year={2011},
  pages={105-112}
}
We present a set of building blocks that provide scalable data movement capability to computational scientists and visualization researchers for writing their own parallel analysis. The set includes scalable tools for domain decomposition, process assignment, parallel I/O, global reduction, and local neighborhood communicationtasks that are common across many analysis applications. The global reduction is performed with a new algorithm, described in this paper, that efficiently merges blocks of… Expand
Portable data-parallel visualization and analysis in distributed memory environments
TLDR
This paper discusses the extension of Thrust to support concurrency in distributed memory environments across multiple nodes, and describes the details of the distributed implementations of several key data-parallel primitives, including scan, scatter/gather, sort, reduce, and upper/lower bound. Expand
BabelFlow: An Embedded Domain Specific Language for Parallel Analysis and Visualization
TLDR
This work presents an embedded domain specific language (EDSL) to describe algorithms using a new task graph abstraction that demonstrates performance portability at scale, and, in some cases, outperforms hand-optimized implementations. Expand
A Scalable Architecture for Simplifying Full-Range Scientific Data Analysis
TLDR
This dissertation has provided an architectural approach that simplifies and scales data analysis on supercomputing architectures while masking parallel intricacies to the user. Expand
Dataflow coordination of data-parallel tasks via MPI 3.0
TLDR
This work extends the load balancing library ADLB to support parallel tasks, and demonstrates how applications can easily be composed of parallel tasks using Swift dataflow scripts, which are compiled to ADLB programs with performance comparable to hand-coded equivalents. Expand
Versatile Communication Algorithms for Data Analysis
TLDR
Three communication algorithms motivated by data analysis workloads--merge based reduction, swapbased reduction, and neighborhood exchange--are presented, and their performance is benchmarked. Expand
Block-parallel data analysis with DIY2
  • D. Morozov, T. Peterka
  • Computer Science
  • 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV)
  • 2016
TLDR
The implementation of the main features of the DIY2 programming model and optimizations to improve performance are described and evaluated on complete analysis codes. Expand
A model for optimizing file access patterns using spatio-temporal parallelism
TLDR
This paper introduces a model that can estimate the read time for a file stored in a parallel filesystem when given the file access pattern and employs spatio-temporal parallelism, which combines both spatial and temporal parallelism to provide greater flexibility to possible file access patterns. Expand
Flexible Analysis Software for Emerging Architectures
TLDR
The approach to accelerator programming forms the basis of the Dax toolkit, a framework to build data analysis and visualization algorithms applicable to exascale computing. Expand
VtkSMP: Task-based Parallel Operators for Accelerating VTK Filters
TLDR
This paper studies the parallelization of patterns commonly used in VTK algorithms and proposes a new multi-threaded plugin for VTK that eases the development of parallel multi-core VTK filters and shows that with a limited code refactoring effort one can take advantage of NUMA node capabilities. Expand
ParGAL: A Scalable Grid-Aware Analysis Library for Ultra Large Datasets.
TLDR
A new Parallel Gridded Analysis Library (ParGAL) is described that performs data-parallel versions of several common analysis algorithms on data from a structured or unstructured grid simulation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
Simplified parallel domain traversal
TLDR
DStep is designed, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale and introduces a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. Expand
Streaming‐Enabled Parallel Dataflow Architecture for Multicore Systems
TLDR
This paper proposes the design of a flexible dataflow architecture aimed at addressing many of the shortcomings of existing systems including a unified execution model for both demand‐driven and event‐driven models; a resource scheduler that can automatically make decisions on how to allocate computing resources; and support for more general streaming data structures which include unstructured elements. Expand
A configurable algorithm for parallel image-compositing applications
TLDR
A new algorithm called Radix-k is presented that embodies and unifies binary swap and direct-send, two of the best-known compositing methods, and enables numerous other configurations through appropriate choices of radices, and shows scalability across image size and system size. Expand
End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P
TLDR
To extend the scalability of the direct-send image compositing stage of the volume rendering algorithm, the number of compositing cores when many small messages are exchanged is limited, and the I/O signatures of the algorithm are studied. Expand
Parallel netCDF: A High-Performance Scientific I/O Interface
TLDR
This work presents a new parallel interface for writing and reading netCDF datasets that defines semantics for parallel access and is tailored for high performance, and compares the implementation strategies and performance with HDF5. Expand
MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems
TLDR
These findings indicate that the hybrid-parallel implementation of raycasting volume rendering, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation. Expand
A generalized approach for transferring data-types with arbitrary communication libraries
  • M. Michel, J. Devaney
  • Computer Science
  • Proceedings Seventh International Conference on Parallel and Distributed Systems: Workshops
  • 2000
TLDR
This addition to AutoMap/AutoLink can extend the functions provided from the current send and receive functions available for any data-types, to any kind of transfer function; from broadcast to reduce (as long as the reduce called process is message aware). Expand
Scalable computation of streamlines on very large datasets
TLDR
This paper reviews two parallelization approaches based on established parallelization paradigms (static decomposition and on-demand loading) and presents a novel hybrid algorithm for computing streamlines aimed at good scalability and performance across the widely varying computational characteristics of streamline-based problems. Expand
Zoltan data management services for parallel dynamic applications
TLDR
The Zoltan library simplifies the load-balancing, data movement, unstructured-communication, and memory usage difficulties that arise in dynamic applications such as adaptive finite-element methods, particle methods, and crash simulations. Expand
Data sieving and collective I/O in ROMIO
  • R. Thakur, W. Gropp, E. Lusk
  • Computer Science
  • Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation
  • 1999
TLDR
This work describes how the MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests and explains in detail the two key optimizations ROMIO performs: data sieving for non Contiguous requests from one process and collective I/O for noncont contiguous requests from multiple processes. Expand
...
1
2
3
4
...