Partitioned Parallel Radix Sort

  title={Partitioned Parallel Radix Sort},
  author={Shindong Lee and Minsoo Jeon and Dongseung Kim and Andrew Sohn},
  journal={J. Parallel Distributed Comput.},
Load balanced parallel radix sort solved the load imbalance problem present in parallel radix sort. Redistributing the keys in each round of radix, each processor has exactly the same number of keys, thereby reducing the overall sorting time. Load balanced radix sort is currently known the fastest internal sorting method for distributed-memory multiprocessors. However, as the computation time is balanced, the communication time emerges as the bottleneck of the overall sorting performance due to… 

Figures and Tables from this paper

PARADIS: An Efficient Parallel Algorithm for In-place Radix Sort

A novel parallel in-place radix sort algorithm, PARADIS, which addresses both problems: a "speculative permutation" solves the first problem by assigning multiple non-continuous array stripes to each processor, and the resulting shared-nothing scheme achieves full parallelization.

Load-balanced parallel merge sort on distributed memory parallel computers

  • Minsoo JeonDongseung Kim
  • Computer Science
    Proceedings 16th International Parallel and Distributed Processing Symposium
  • 2002
Sort can be speeded up on parallel computers by dividing and computing data individually in parallel. Merge sort can be parallelized, however, the conventional algorithm implemented on distributed

Parallel Merge Sort with Load Balancing

The proposed load-balanced merge sort utilizes all processors throughout the computation and evenly distributes data to all processors in each stage, thereby every processor is forced to work in all phases.

A full parallel radix sorting algorithm for multicore processors

This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part and is user friendly in that the user calling this PARL algorithm does not have to deal with creating threads themselves; to sort their data, they just create a sorting object and make a call to a thread safe method in that object.

Parallelizing Merge Sort onto Distributed Memory Parallel Computers

Merge sort is useful in sorting a great number of data progressively, especially when they can be partitioned and easily collected to a few processors. Merge sort can be parallelized, however,

Theoretically-Efficient and Practical Parallel In-Place Radix Sorting

The performance of Regions Sort is compared to existing parallel in-place and out-of-place sorting algorithms on a variety of input distributions and shown to be faster than optimized out- of-place radix sorting and comparison sorting algorithms.

High-speed parallel external sorting of data with arbitrary distribution

Two distribution-insensitive scalable parallel external sorting algorithms that use sampling technique and histogram counts to achieve even distribution of keys, which eventually contribute to achieve good performance are developed.

Modified Pure Radix Sort for Large Heterogeneous Data Set

This research paper presents a new modified pure radix sort algorithm for large heterogeneous data set, which tries to optimize all related problems of radixsort through this algorithm.

Communication-efficient bitonic sort on a distributed memory parallel computer

An efficient way of data communication in bitonic sort to minimize the interprocessor communication and comparison time is presented and it is believed the scheme is a good way to shorten the communication time in similar applications.

A Load-Balanced Parallel and Distributed Sorting Algorithm Implemented with PGX.D

Sorting has been one of the most challenging studied problems in different scientific researches. Although many techniques and algorithms have been proposed on the theory of having efficient parallel



Load balanced parallel radix sort

Experimental results indicate that balanced radix sort can sort OSG integers in 20 seconds and 128M doubles in 15 seconds on a 64-processor SPZWN while yielding over 40-fold speedup.

Communication-efficient bitonic sort on a distributed memory parallel computer

An efficient way of data communication in bitonic sort to minimize the interprocessor communication and comparison time is presented and it is believed the scheme is a good way to shorten the communication time in similar applications.

Parallel algorithms for personalized communication and sorting with an experimental study (extended abstract)

A novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and performance is invariant over the set of input distributions unlike previous efficient algorithms.

Fast Parallel Sorting Under LogP: Experience with the CM-5

The LogP model is shown to be a valuable guide in the development of parallel algorithms and a good predictor of implementation performance; the model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention.

Minimizing Communication in the Bitonic Sort

A recirculating bitonic sorting network is presented, which is composed of one level of N/2 comparators plus an /spl Omega/-network of (log N-1) switch levels, which reduces the cost complexity to O(N log N) compared with the O (N log/sup 2/ N) of the original bitonic sorted network, while preserving the same time complexity.

Parallel sorting and data partitioning by sampling

The analysis is developed for parallel sorting in a local network environment, with distributed data sets in secondary storage devices, and a data partitioning method by sampling is proposed.

Identifying the capability of overlapping computation with communication

  • A. SohnJui-Yuan Ku Y. Yamaguchi
  • Computer Science
    Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique
  • 1996
Experimental results indicate that both multiprocessors would yield up to 30% to 40% overlap of communication time when the message size is approximately 1K integers.

Tight Bounds on the Complexity of Parallel Sorting

  • F. Leighton
  • Computer Science, Mathematics
    IEEE Transactions on Computers
  • 1985
Tight upper and lower bounds are proved on the number of processors, information transfer, wire area, and time needed to sort N numbers in a bounded-degree fixed-connection network.

Sorting networks and their applications

To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several