Corpus ID: 12083012

Efficient Parallel Scan Algorithms for GPUs

@inproceedings{Sengupta2011EfficientPS,
  title={Efficient Parallel Scan Algorithms for GPUs},
  author={Shubhabrata Sengupta and Mark J. Harris and Michael Garland},
  year={2011}
}
Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. [...] Key Method Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly ecient, and free of irregular access patterns that lead to memory bank conicts. These algorithms form the basis for current and upcoming releases of the…Expand
Efficient Scan Operator Methods on a GPU
TLDR
A number of parallel scan methods based on the traditional cyclic reduction tridiagonal solver and the Ladner-Fischer parallel prefix adder are presented and a set of new features introduced in the Kepler Nvidia architecture such as read-only data cache and shuffle instructions are analyzed. Expand
Automatic Scan Parallelization in OpenMP
TLDR
Experiments running a set of typical scan based algorithms on NVIDIA, Intel, and ARM GPUs reveal that the performance of the proposed OpenMP clause is equivalent to that achieved when using OpenCL library calls, with the advantage of a simpler programming complexity. Expand
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
TLDR
StreamScan is a novel approach to implement scan on GPUs with only one computation phase, and the main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. Expand
Parallel Scan for Stream Architectures
TLDR
This work presents three implementations of parallel scan that are memory-bound, utilize 100% of achievable memory bandwidth, and only require the use of a constant amount of global device memory for the storage of intermediate results. Expand
Optimizing Parallel Prefix Operations for the Fermi Architecture
TLDR
This chapter demonstrates the application of new instructions designed to facilitate basic, but important, parallel primitives on per-thread predicates as well as instructions for manipulating and querying bits within a word in the construction of efficient parallel algorithm primitives such as reductions, scans, and segmented scans of binary or Boolean data. Expand
Efficient Data-Parallel Primitives on Heterogeneous Systems
TLDR
The goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms and deliver common primitive optimization strategies for heterogeneous systems. Expand
Single-pass Parallel Prefix Scan with Decoupled Lookback
We describe a work-efficient, communication-avoiding, singlepass method for the parallel computation of prefix scan. When consuming input from memory, our algorithm requires only ~2n data movement: nExpand
Efficient Solving of Scan Primitive on Multi-GPU Systems
TLDR
This article presents a GPU-suitable algorithm and a tuning strategy for performing the scan primitive over large problem sizes in CUDA, and describes an optimal proposal analyzed over different multiple GPU environments, the first multiple-GPU batch scan proposal to the best of the authors' knowledge. Expand
A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment
  • Sang-Won Ha, T. Han
  • Computer Science
  • IEEE Transactions on Parallel and Distributed Systems
  • 2013
TLDR
This study presents a parallel scan method that is derived from the Han-Carlson parallel prefix graph and is both a work-efficient and a depth-optimal process and employs a novel cascaded thread-block execution method to exploit the single-program-multiple- data (SPMD) nature of the compute unified device architecture (CUDA) environment developed by NVIDIA. Expand
A Novel Parallel Scan for Multicore Processors and Its Application in Sparse Matrix-Vector Multiplication
  • N. Zhang
  • Computer Science
  • IEEE Transactions on Parallel and Distributed Systems
  • 2012
TLDR
A novel approach to sparse matrix-vector multiplication (SpMV) based on the proposed scan, unlike the existing ones that make use of backward segmented operations, uses forward ones for more efficient caching. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 22 REFERENCES
Fast scan algorithms on graphics processors
TLDR
This work uses novel data representations that map well to the GPU architecture to exploit shared memory to improve memory performance and improves the performance of algorithms for scan and segmented scan by eliminating shared-memory bank conflicts and reducing the overheads. Expand
Scan primitives for GPU computing
TLDR
Using the scan primitives, this work shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyzes the performance of the scanPrimitives, several sort algorithms that use the scan Primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver. Expand
Scans as Primitive Parallel Operations
A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptoticExpand
Scan primitives for vector computers
TLDR
The authors describe an optimized implementation of a set of scan primitives on a single processor of a CRAY Y-MP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. Expand
Vector Models for Data-Parallel Computing
TLDR
A model of parallelism that extends and formalizes the Data-Parallel model on which the Connection Machine and other supercomputers are based is described, and it is argued that data-parallel models are not only practical and can be applied to a surprisingly wide variety of problems, they are also well suited for very-high-level languages and lead to a concise and clear description of algorithms and their complexity. Expand
Implementation of a Portable Nested Data-Parallel Language
TLDR
Initial benchmark results of NESL show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data. Expand
A Work-Efficient Step-Efficient Prefix Sum Algorithm
TLDR
The Prefix-sum algorithm is one of the most important building blocks for data-parallel computation and its applications include parallel implementations of deleting marked elements from an array, radix-sort, solving recurrence equations, solving tri-diagonal linear systems, and quicksort. Expand
On Parallel Prefix Computation
We prove that prefix sums of n integers of at most b bits can be found on a COMMON CRCW PRAM in time with a linear time-processor product. The algorithm is optimally fast, for any polynomial numberExpand
Fast Summed‐Area Table Generation and its Applications
TLDR
The algorithm for generating summed-area tables is introduced, similar to a technique used in scientific computing called recursive doubling, that allows the generation of a summed- area table in O(log n) time. Expand
A programming language
  • K. Iverson
  • Computer Science
  • AIEE-IRE '62 (Spring)
  • 1962
TLDR
The paper describes a succinct problem-oriented programming language that relies heavily on a systematic extension of a small set of basic operations to vectors, matrices, and trees, and on a family of flexible selection operations controlled by logical vectors. Expand
...
1
2
3
...