High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs

@inproceedings{Nugteren2011HighPP,
  title={High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs},
  author={Cedric Nugteren and Gert-Jan van den Braak and Henk Corporaal and Bart Mesman},
  booktitle={GPGPU-4},
  year={2011}
}
Graphics Processing Units (GPUs) are suitable for highly data parallel algorithms such as image processing, due to their massive parallel processing power. Many image processing applications use the histogramming algorithm, which fills a set of bins according to the frequency of occurrence of pixel values taken from an input image. Histogramming has been mapped on a GPU prior to this work. Although significant research effort has been spent in optimizing the mapping, we show that the… 
Efficient Weighted Histogramming on GPUs with CUDA
TLDR
A new method for histogramming on GPUs is presented, which reduces the collision intensity by rearranging the input, and provides predictable performance over data sets with different statistics, and shows improved performance over the state-of-the-art implementations.
An optimized approach to histogram computation on GPU
TLDR
This paper proposes a highly optimized approach to histogram calculation that uses histogram replication for eliminating position conflicts, padding to reduce bank conflicts, and an improved access to input data called interleaved read access.
Improving GPU Performance: Reducing Memory Conflicts and Latency
TLDR
A set of software techniques to improve the parallel updating of the output bins in the voting algorithms, the so called ‘voting algorithms’ such as histogram and Hough transform, are analyzed, implemented and optimized on GPUs.
An OpenACC Optimizer for Accelerating Histogram Computation on a GPU
  • Kei Ikeda, Fumihiko Ino, K. Hagihara
  • Computer Science
    2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
  • 2016
TLDR
A source-to-source OpenACC optimizer that automatically optimizes a histogram computation code for a graphics processing unit (GPU) and automatically rewrites the detected blocks such that multiple copies of histograms can be exploited for acceleration.
Modestly faster histogram computations on GPUs
TLDR
TRISH is a deterministic algorithm that avoids atomic operations and gives performance that is data independent that runs up to 50% faster than previous GPU methods for random data and 2-4× faster for image data.
Multireduce and Multiscan on Modern GPUs
TLDR
This thesis proposes an algorithm which, despite its generality, performs at least as well as the best published histogramming algorithm for all inputs and provides a 18× speedup over the CPU algorithm for small numbers of buckets, making this solution the fastest existing histogram algorithm for GPUs.
Simulation and architecture improvements of atomic operations on GPU scratchpad memory
TLDR
This paper proposes to use a hash function in both the addressing of the banks and the locks of the scratchpad memory in GPGPU-Sim to reduce serialization of threads and result in a speed-up in histogram and Hough transform applications with minimum hardware costs.
Compiling generalized histograms for GPU
TLDR
It is shown that the histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.
Compiling Generalized Histograms for GPU
TLDR
It is shown that the histogram implementation taken in isolation outperforms similar primitives from CUB, and that it is competitive or outperforms the hand-written code of several application benchmarks, even when the latter is specialized for a class of datasets.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Design and Performance Evaluation of Image Processing Algorithms on GPUs
In this paper, we construe key factors in design and evaluation of image processing algorithms on the massive parallel graphics processing units (GPUs) using the compute unified device architecture
GPU histogram computation
TLDR
This poster presents a method to compute histograms in shader programs and shows that the method enables iterative and histogram guided algorithms to run efficiently on graphics hardware without costly CPU intervention.
Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices
TLDR
Two efficient histogram algorithms designed for NVIDIA’s compute unified device architecture (CUDA) compatible graphics processor units (GPUs) are presented, showing that the speed of histogram calculations can be improved by up to 30 times compared to a CPU-based implementation.
Efficient histogram generation using scattering on GPUs
TLDR
An efficient algorithm to compute image histograms entirely on the GPU that allows us to create histograms with arbitrary numbers of buckets in a single rendering pass, and avoids the need for any communication from the GPU back to the CPU.
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
TLDR
This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
Parallel Image Processing Based on CUDA
TLDR
The distinct features ofCUDA GPU are analyzed, the general program mode of CUDA is summarized and several classical image processing algorithms by CUDA, such as histogram equalization, removing clouds, edge detection and DCT encode and decode are implemented.
Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries
TLDR
An extension to the CUDA tool-chain is described, providing programmers with a visualization of register life ranges, and guidelines describing how to apply optimizations in order to obtain a lower register pressure are presented.
OpenVIDIA: parallel GPU computer vision
TLDR
This paper proposes using GPUs in approximately the reverse way: to assist in "converting pictures into numbers" (i.e. computer vision) and provides a simple API which implements some common computer vision algorithms.
Software engineering for multicore systems: an experience report
TLDR
An experience report with four diverse case studies on multicore software development for general-purpose applications, programmed in different languages and benchmarked on several multicore computers concludes that Tuneable architectural patterns with parallelism at several levels need to be discovered.
Evaluating MapReduce for Multi-core and Multiprocessor Systems
TLDR
It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.
...
1
2
...