Practical Massively Parallel Sorting

  title={Practical Massively Parallel Sorting},
  author={Michael Axtmann and Timo Bingmann and Peter Sanders and Christian Schulz},
  journal={Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures},
Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viable compromise and overcome this gap both in theory and practice. The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort. In particular, our sample sort variant turns out to be very scalable both in theory and practice where it… 

Figures and Tables from this paper

Robust Massively Parallel Sorting
This work investigates distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements and designs a new variant of quicksort with fast high-quality pivot selection.
Engineering In-place (Shared-memory) Sorting Algorithms
In many of the remaining cases, the new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm, confirming the claims made about the robust performance of the algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.
Engineering a Distributed Histogram Sort
This work adopts ideas of the well-known quickselect and sample sort algorithms to minimize data movement and demonstrates that this implementation can keep up with recently proposed distribution sort algorithms in large-scale experiments, without any assumptions on the input keys.
In-place Parallel Super Scalar Samplesort (IPS$^4$o)
We present a sorting algorithm that works in-place, executes in parallel, is cache-efficient, avoids branch-mispredictions, and performs work O(n log n) for arbitrary inputs with high probability.
Communication-Efficient String Sorting
These algorithms inspect only characters that are needed to determine the sorting order and communication volume is reduced by also communicating only those characters and by communicating repetitions of the same prefixes only once.
Theoretically-Efficient and Practical Parallel In-Place Radix Sorting
The performance of Regions Sort is compared to existing parallel in-place and out-of-place sorting algorithms on a variety of input distributions and shown to be faster than optimized out- of-place radix sorting and comparison sorting algorithms.
Massively Parallel ’ Schizophrenic ’ Quicksort
A communication library based on MPI is presented that supports communicator creation in constant time and without communication and the first efficient implementation of Schizophrenic Quicksort, a recursive sorting algorithm for distributed memory systems that is based on Quicksorts is presented.
Fully Flexible Parallel Merge Sort for Multicore Architectures
A fully flexible sorting method designed for parallel processing based on modified merge sort that can be implemented for a number of processors and shows that with each newly added processor sorting becomes faster and more efficient.
Distributed String Sorting Algorithms
This thesis presents two new distributed string sorting algorithms and introduces a new string generator producing string data sets with the ratio of the distinguishing prefix length to the entire string length being an input parameter.
Efficient Parallel Random Sampling—Vectorized, Cache-Efficient, and Online
A simple divide-and-conquer scheme is proposed that makes sequential algorithms more cache efficient and leads to a parallel algorithm running in expected time O(n/p+log p) on p processors, i.e., scales to massively parallel machines even for moderate values of n.


A comparison of sorting algorithms for the connection machine CM-2
A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.
Direct Bulk-Synchronous Parallel Algorithms
It is shown that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters p, g, and L.
Parallel sorting by over partitioning
  • Hui Li, K. Sevcik
  • Computer Science
    SPAA '94
  • 1994
Implementations on the KSR1 and Hector shared memory multiprocessors show that PSOP achieves nearly linear speedup and outperforms alternative approaches.
Practical Massively Parallel Sorting - Basic Algorithmic Ideas
This work outlines ideas how to combine a number of basic algorithmic techniques which overcome bottlenecks to obtain sorting algorithms that scale to the largest available machines.
Parallel Sorting by Overpartitioning
The approach of parallel sorting by Overpartitioning (PSOP) limits the communication cost by moving each element between the processors at most once, and ensures good load balancing (even
Communication efficient algorithms for fundamental big data problems
This work discusses linear programming in low dimensions, and gives examples for several fundamental algorithmic problems where nontrivial algorithms with sublinear communication volume are possible.
Communication-Efficient Parallel Sorting
The bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for it is shown that just computing the "or" of n bits distributed evenly to the first O(n/h) of an arbitrary number of processors in a BSP computer requires $\Omega(\log n/\log (h+1))$ communication rounds.
Sorting networks and their applications
To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several
Communication Efficient Algorithms for Top-k Selection Problems
We present scalable parallel algorithms with sublinear per-processor communication volume and low latency for several fundamental problems related to finding the most relevant elements in a set, for