Robust Massively Parallel Sorting

  title={Robust Massively Parallel Sorting},
  author={Michael Axtmann and Peter Sanders},
We investigate distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements. The main outcome is that four sorting algorithms cover the entire range of possible input sizes. For three algorithms we devise new low overhead mechanisms to make them robust with respect to duplicate keys and skewed input distributions. One of these, designed for medium sized inputs, is a new variant of… 

Figures and Tables from this paper

Parallel Quicksort without Pairwise Element Exchange

It is shown that with good pivot selection, Quicksort without pairwise element exchange can be significantly faster than standard implementations on moderately large problems, and for smaller input sizes, standard and exchange-free variants can be combined to exploit the exchangefree variant as subproblems become large enough relative to the number of processors.

Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting, and proposes both multiway distribution-based with string sample sort and multiway merge-based string sorting with LCP-aware merge and mergesort, and engineer and parallelize both approaches.

Parallel Quicksort without Pairwise Element Exchange

A template implementation is given that reduces the total volume of data exchanged from O(n\log p) to $O(n)$, $n$ being the total number of elements to be sorted and $p$ a power-of-two number of processors, while preserving the flavor, characteristics and properties of a Quicksort implementation.

Communication-Efficient String Sorting

These algorithms inspect only characters that are needed to determine the sorting order and communication volume is reduced by also communicating only those characters and by communicating repetitions of the same prefixes only once.

OLAPS: Online load-balancing in range-partitioned main memory database with approximate partition statistics

This paper proposes an approach for maintaining balanced loads over a set of nodes as in a system of communicating vessels, by migrating tuples between neighboring nodes, based on an approximate Partition Statistics Table.

Engineering faster sorters for small sets of items

The results clearly show the potential of using conditional moves in the field of sorting algorithms, as when sorting only small sets of integers, the sorting networks outperform insertion sort.

Massively Parallel ’ Schizophrenic ’ Quicksort

A communication library based on MPI is presented that supports communicator creation in constant time and without communication and the first efficient implementation of Schizophrenic Quicksort, a recursive sorting algorithm for distributed memory systems that is based on Quicksorts is presented.

Connected Components on a PRAM in Log Diameter Time

This work presents an O(log d + log logm/n n)-time randomized PRAM algorithm for computing the connected components of an n-vertex, m-edge undirected graph with maximum component diameter d and suggests that additional power might not be necessary for fundamental graph problems like connected components and spanning forest.

Decentralized Online Scheduling of Malleable NP-hard Jobs

This work addresses an online job scheduling problem in a large distributed computing environment, using the NP-complete problem of propositional satisfiability (SAT) as a case study, and shows that its approach leads to near-optimal utilization, imposes minimal computational overhead, and performs fair scheduling of incoming jobs within a few milliseconds.



A Randomized Parallel Sorting Algorithm with an Experimental Study

A novel variation on sample sort which uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead, and its performance is invariant over the set of input distributions unlike previous efficient algorithms.

Practical Massively Parallel Sorting

The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort, which turns out to be very scalable both in theory and practice where it scales up to 215 MPI processes with outstanding performance in particular for medium sized inputs.

Highly scalable parallel sorting

  • Edgar SolomonikL. Kalé
  • Computer Science
    2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • 2010
A scalable extension of the Histogram Sorting method is presented, making fundamental modifications to the original algorithm in order to minimize message contention and exploit overlap.

On the Efficient Implementation of Massively Parallel Quicksort

A high performance variant of parallel Quicksort which incorporates the following optimizations: Stop the recursion at the right time, sort locally rst, use accurate yet eecient pivot selection strategies, streamline communication patterns, use locality preserving processor indexing schemes and work with multiple pivots at once.

Efficient Massively Parallel Quicksort

This work has implemented a high performance variant of parallel quicksort which incorporates the following optimizations: Stop the recursion at the right time, sort locally first, use accurate yet efficient pivot selection strategies, streamline communication patterns, use locality preserving processor indexing schemes and work with multiple pivots at once.

HykSort: a new variant of hypercube quicksort on distributed memory architectures

HekSort is an optimized comparison sort for distributed memory architectures that attains more than 2× improvement over bitonic sort and samplesort and also presents a staged communication samplesort, which is more robust than the original samplesort for large core counts.

Parallel Quicksort in hypercubes

A new parallel algorithm, named Cubequic&sort, which is modified from Hyperquicksort, which has a better performance than the other three algorithms and makes a better estimations of median keys to ensure a more balanced key distribution among the processor nodes.

A comparison of sorting algorithms for the connection machine CM-2

A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.

Super Scalar Sample Sort

The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions, which facilitates optimizations like loop unrolling and software pipelining.

Resource Oblivious Sorting on Multicores

A deterministic sorting algorithm, Sample, Partition, and Merge Sort (SPMS), that interleaves the partitioning of a sample sort with merging and sorts n elements in O(nlog n) time cache-obliviously with an optimal number of cache misses.