Communication conscious radix sort

@inproceedings{JimnezGonzlez1999CommunicationCR,
  title={Communication conscious radix sort},
  author={Daniel Jim{\'e}nez-Gonz{\'a}lez and Josep-Llu{\'i}s Larriba-Pey and Juan J. Navarro},
  booktitle={ICS '99},
  year={1999}
}
The exploitation of data locality in parallel computers is paramount to reduce the memory traffic and communication among processing nodes. We focus on the exploitation of locality by Parallel Radix sort. The original Parallel Radix sort has several communication steps in which one sorting key may have to visit several processing nodes. In response to this, we propose a reorganization of Radix sort that leads to a highly local version of the algorithm at a very low cost. As a key feature, our… 

Figures from this paper

The effect of local sort on parallel sorting algorithms
TLDR
There are three important contributions in SCS-Radix sort: first, the work saved by detecting data skew dynamically; second, the exploitation of the memory hierarchy done by the algorithm; and third, the execution time stability of SCS -Radix when sorting data sets with different characteristics.
Fast parallel in-memory 64-bit sorting
TLDR
A new algorithm that is more than 2 times faster than the previous fastest 64-bit parallel sorting algorithm, PCS-Radix sort, which adapts to any parallel computer by changing three simple algorithmic parameters.
CC-Radix: a cache conscious sorting based on Radix sort
TLDR
CC-Radix improves the data locality by dynamically partitioning the data set into subsets that fit in cache level L/sub 2/.
Improving Communication Sensitive Parallel Radix Sort for Unbalanced Data
TLDR
An efficient improvement is presented which helps to overcome the problems with unbalanced data characteristics and is tested practically on a Linux-based SMP cluster.
Sorting on the SGI Origin 2000: comparing MPI and shared memory implementations
TLDR
This paperalyses the C/sup 3/-Radix (Communication- and Cache-Conscious Radix) sort algorithm, using the distributed and the shared memory parallel programming models, and explains the reasons for the different behaviours depending on the size of the data sets.
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
TLDR
This paper describes a new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and cache memory of today's processors based on multiway mergesort, and shows that this approach exhibited up to 2.1x better single-thread performance than the key-index approach implemented withSIMD instructions when sorting 512M 16-byte records on one core.
Automatic generation of a parallel sorting algorithm
TLDR
Preliminary experimental results show that the automatic generation of a distributed memory parallel sorting routine provides up to a four fold improvement over standard parallel algorithms with typical parameters.
Designing parallel algorithms for SMP clusters
TLDR
Methods for designing and optimizing parallel algorithms for SMP clusters, which combines two different concepts, show an alternative way that shows how to adapt the algorithms to the hierarchical environment.
How SIMD width affects energy efficiency: A case study on sorting
  • H. Inoue
  • Computer Science
    2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)
  • 2016
TLDR
The results show that SIMD can reduce power in addition to enhancing the performance, especially when the memory bandwidth is not sufficient to fully drive the cores.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Fast parallel in-memory 64-bit sorting
TLDR
A new algorithm that is more than 2 times faster than the previous fastest 64-bit parallel sorting algorithm, PCS-Radix sort, which adapts to any parallel computer by changing three simple algorithmic parameters.
Load balanced parallel radix sort
TLDR
Experimental results indicate that balanced radix sort can sort OSG integers in 20 seconds and 128M doubles in 15 seconds on a 64-processor SPZWN while yielding over 40-fold speedup.
Sorting on the SGI Origin 2000: comparing MPI and shared memory implementations
TLDR
This paperalyses the C/sup 3/-Radix (Communication- and Cache-Conscious Radix) sort algorithm, using the distributed and the shared memory parallel programming models, and explains the reasons for the different behaviours depending on the size of the data sets.
Adapting Radix Sort to the Memory Hierarchy
TLDR
The importance of reducing misses in the translation-lookaside buffer (TLB) for obtaining good performance on modern computer architectures is demonstrated and three techniques which simultaneously reduce cache and TLB misses for LSB radix sort are given: reducing working set size, explicit block transfer and pre-sorting.
A Benchmark Parallel Sort for Shared Memory Multiprocessors
The first parallel sort algorithm for shared memory MIMD (multiple-instruction-multiple-data-stream) multiprocessors that has a theoretical and measured speedup near linear is exhibited. It is based
A super scalar sort algorithm for RISC processors
TLDR
New sort algorithms which eliminate almost all the compares, provide functional parallelism which can be exploited by multiple execution units, significantly reduce the number of passes through keys, and improve data locality are developed.
An analysis of superscalar sorting algorithms on an R8000 processor
TLDR
It is possible to understand that Radix sort is the most promising of the methods studied here for future superscalar architectures and the use of combined methods does not help to exploit locality.
Parallel algorithms for personalized communication and sorting with an experimental study (extended abstract)
TLDR
A novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and performance is invariant over the set of input distributions unlike previous efficient algorithms.
The Block Distributed Memory Model
  • J. JáJá, K. Ryu
  • Computer Science
    IEEE Trans. Parallel Distributed Syst.
  • 1996
TLDR
This work introduces a computation model for developing and analyzing parallel algorithms on distributed memory machines and shows that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity.
Design, analysis, and implementation of parallel external sorting algorithms
TLDR
A modified merge-sort is proposed to use as a method for eliminating duplicate records in a large file and a combinatorial model is developed to provide an accurate estimate for the cost of the duplicate elimination operation (both in the serial and the parallel cases).
...
...