Practical Massively Parallel Sorting

  title={Practical Massively Parallel Sorting},
  author={Michael Axtmann and Timo Bingmann and Peter Sanders and Christian Schulz},
  journal={Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures},
Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viable compromise and overcome this gap both in theory and practice. The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort. In particular, our sample sort variant turns out to be very scalable both in theory and practice where it… 

Figures and Tables from this paper

Robust Massively Parallel Sorting
This work investigates distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements and designs a new variant of quicksort with fast high-quality pivot selection.
Engineering In-place (Shared-memory) Sorting Algorithms
In many of the remaining cases, the new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm, confirming the claims made about the robust performance of the algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.
Engineering a Distributed Histogram Sort
This work adopts ideas of the well-known quickselect and sample sort algorithms to minimize data movement and demonstrates that this implementation can keep up with recently proposed distribution sort algorithms in large-scale experiments, without any assumptions on the input keys.
In-place Parallel Super Scalar Samplesort (IPS$^4$o)
We present a sorting algorithm that works in-place, executes in parallel, is cache-efficient, avoids branch-mispredictions, and performs work O(n log n) for arbitrary inputs with high probability.
Communication-Efficient String Sorting
These algorithms inspect only characters that are needed to determine the sorting order and communication volume is reduced by also communicating only those characters and by communicating repetitions of the same prefixes only once.
Theoretically-Efficient and Practical Parallel In-Place Radix Sorting
The performance of Regions Sort is compared to existing parallel in-place and out-of-place sorting algorithms on a variety of input distributions and shown to be faster than optimized out- of-place radix sorting and comparison sorting algorithms.
Massively Parallel ’ Schizophrenic ’ Quicksort
A communication library based on MPI is presented that supports communicator creation in constant time and without communication and the first efficient implementation of Schizophrenic Quicksort, a recursive sorting algorithm for distributed memory systems that is based on Quicksorts is presented.
Fully Flexible Parallel Merge Sort for Multicore Architectures
A fully flexible sorting method designed for parallel processing based on modified merge sort that can be implemented for a number of processors and shows that with each newly added processor sorting becomes faster and more efficient.
Distributed String Sorting Algorithms
This thesis presents two new distributed string sorting algorithms and introduces a new string generator producing string data sets with the ratio of the distinguishing prefix length to the entire string length being an input parameter.
Parallelization of Modified Merge Sort Algorithm
The claim is that the parallelization of the sorting method is faster and beneficial for multi-core systems and is compared to other sorting methods like quick sort, heap sort, and merge sort to show potential efficiency.


Highly scalable parallel sorting
  • Edgar Solomonik, L. Kalé
  • Computer Science
    2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • 2010
A scalable extension of the Histogram Sorting method is presented, making fundamental modifications to the original algorithm in order to minimize message contention and exploit overlap.
Super Scalar Sample Sort
The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions, which facilitates optimizations like loop unrolling and software pipelining.
A comparison of sorting algorithms for the connection machine CM-2
A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.
Direct Bulk-Synchronous Parallel Algorithms
It is shown that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters p, g, and L.
Parallel sorting by over partitioning
  • Hui Li, K. Sevcik
  • Computer Science
    SPAA '94
  • 1994
Implementations on the KSR1 and Hector shared memory multiprocessors show that PSOP achieves nearly linear speedup and outperforms alternative approaches.
Practical Massively Parallel Sorting - Basic Algorithmic Ideas
This work outlines ideas how to combine a number of basic algorithmic techniques which overcome bottlenecks to obtain sorting algorithms that scale to the largest available machines.
Parallel Sorting by Overpartitioning
The approach of parallel sorting by Overpartitioning (PSOP) limits the communication cost by moving each element between the processors at most once, and ensures good load balancing (even
Communication efficient algorithms for fundamental big data problems
This work discusses linear programming in low dimensions, and gives examples for several fundamental algorithmic problems where nontrivial algorithms with sublinear communication volume are possible.
Communication-Efficient Parallel Sorting
The bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for it is shown that just computing the "or" of n bits distributed evenly to the first O(n/h) of an arbitrary number of processors in a BSP computer requires $\Omega(\log n/\log (h+1))$ communication rounds.
Engineering Parallel String Sorting
This work proposes string sample sort, a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, and describes sequential LCP-insertion sort which calculates the LCP array and accelerates its insertions using it.