Scalable distributed-memory external sorting

  title={Scalable distributed-memory external sorting},
  author={Mirko Rahn and Peter Sanders and John Victor Singler},
  journal={2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)},
  • M. Rahn, P. Sanders, J. Singler
  • Published 14 October 2009
  • Computer Science
  • 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)
We engineer algorithms for sorting huge data sets on massively parallel machines. The algorithms are based on the multiway merging paradigm. We first outline an algorithm whose I/O requirement is close to a lower bound. Thus, in contrast to naive implementations of multiway merging and all other approaches known to us, the algorithm works with just two passes over the data even for the largest conceivable inputs. A second algorithm reduces communication overhead and uses more conventional… 

Figures and Tables from this paper

Communication-Efficient String Sorting
These algorithms inspect only characters that are needed to determine the sorting order and communication volume is reduced by also communicating only those characters and by communicating repetitions of the same prefixes only once.
Engineering Algorithms for Large Data Sets
This paper outlines the general challenges of algorithm engineering and gives examples from my work like sorting, full text indexing, graph algorithms, and database engines.
Algorithm Engineering for Scalable Parallel External Sorting
  • P. Sanders
  • Computer Science
    2011 IEEE International Parallel & Distributed Processing Symposium
  • 2011
The talk describes algorithm engineering (AE) as a methodology for algorithmic research where design, analysis, implementation and experimental evaluation of algorithms form a feedback cycle driving
Algorithm libraries for multi-core processors
By providing parallelized versions of established algorithm libraries, the Multi-Core STL provides basic algorithms for internal memory and the parallelized STXXL enables multi-core acceleration for algorithms on large data sets stored on disk.
TritonSort: A Balanced Large-Scale Sorting System
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in
Energy-efficient sorting using solid state disks
Using a low-power processor, solid state disks, and efficient algorithms, this work beats the current records in the JouleSort benchmark for 10GB to 1 TB of data by factors of up to 5.1.
TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System
This article describes the hardware and software architecture necessary to operate TritonSort, a highly efficient, scalable sorting system designed to process large datasets, and is able to sort data at approximately 80% of the disks’ aggregate sequential write speed.
Cache efficient functional algorithms
A cost model for analyzing the memory efficiency of algorithms expressed in a simple functional language is presented and provable bounds imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be asymptotically as efficient as the carefully designed imperative I/O efficient algorithms.
Parallel Data Sort Using Networked FPGAs
This paper shows an example of a data sorting application that uses parallel servers to pre-sort data and then uses FPGAs within the switch to merge sort data as it passes through the network thereby reducing computation requirements at the client node.
Out-of-core distribution sort in the FG programming environment
  • P. Natarajan, T. Cormen, E. Strange
  • Computer Science
    2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
  • 2010
Experimental results show that by using multiple pipelines, an out-of-core, distribution-based sorting program outperforms an out of-core sorting program based on columnsort approximately 75%–85% of the time-despite the advantages that the columnsort-based program holds.


DEMSort — Distributed External Memory Sort
We present the results of our DEMSort program in various categories of the SortBenchmark. DEMSort is a sophisticated and highly tuned implementation of a mergesort-based algorithm. It makes use of
Asynchronous parallel disk sorting
We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either
Optimal parallel sorting in multi-level storage
It is found that Sharesort achieves optimal time bounds for parallel sorting in multi-level storage, under a variety of models that have been defined in the literature.
High-performance sorting on networks of workstations
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale
Deterministic distribution sort in shared and distributed memory multiprocessors
An elegant deterministic load balancing strategy for distribution sort that is applicable to a wide variety of parallel diska and parallel memory hierarchies with both single and parallel processors and shows how to sort determiniatically in parallelMemory hierarchies.
Bulk Synchronous Parallel Algorithms for the External Memory Model
A simple, deterministic simulation technique is presented which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms that meet well known I /O complexity lower bounds for various problems, including sorting.
Slabpose Columnsort: A New Oblivious Algorithm for Out-of-Core Sorting on Distributed-Memory Clusters
Slabpose columnsort is presented, a new oblivious algorithm that is the first out-of-core multiprocessor sorting algorithms that make no assumptions about the keys and produce output that is perfectly load balanced and in the striped order assumed by the Parallel Disk Model.
Merging Multiple Lists on Hierarchical-Memory Multiprocessors
Performance and scalability of parallel database systems
This work proposes an architecture which extends the features of the shared-nothing architecture, widely adopted for current parallel database applications, and proposes a new characterization of data skew which captures distinct types of imbalance and presents two data partitioning strategies to deal with this problem in a parallel system.
Algorithms for parallel memory, II: Hierarchical multilevel memories
The optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a two-level memory with parallel block transfer.