# Parallel sorting on a shared-nothing architecture using probabilistic splitting

@article{DeWitt1991ParallelSO, title={Parallel sorting on a shared-nothing architecture using probabilistic splitting}, author={David J. DeWitt and Jeffrey F. Naughton and Donovan A. Schneider}, journal={[1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems}, year={1991}, pages={280-291} }

The authors consider the problem of external sorting in a shared-nothing multiprocessor. A critical step in the algorithms the authors consider is to determine the range of sort keys to be handled by each processor. They consider two techniques for determining these ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by Iyer, Ricard, and Varman; and probabilistic splitting, which uses sampling to estimate quantiles. They present analytic results showing that…

## Figures and Tables from this paper

## 167 Citations

Parallel Sorting of Large Data Volumes on Distributed Memory Multiprocessors

- Computer ScienceParallel Computer Architectures
- 1993

This algorithm is suited for large data volumes (external sorting) and does not suffer from processing skew in presence of data skew and the optimal degree of CPU parallelism is derived if I/O limitations are taken into account.

Parallel Sorting of Large Data Volumes on Distributed Memory Multiprocessors

- Computer Science
- 1993

This algorithm is suited for large data volumes (external sorting) and does not suffer from processing skew in presence of data skew and the optimal degree of CPU parallelism is derived if I/O limitations are taken into account.

PPS-a parallel partition sort algorithm for multiprocessor database systems

- Computer ScienceProceedings 11th International Workshop on Database and Expert Systems Applications
- 2000

Experimental results demonstrate that the new algorithm performs better than existing parallel range partition sorting algorithms in a shared-nothing database environment for a wide degree of skew.

Overlapping Computations, Communications and I/O in parallel Sorting

- Computer ScienceJ. Parallel Distributed Comput.
- 1995

A new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node is presented, which is shown to be of similar complexity to known efficient sorting algorithms.

A synthesis of parallel out-of-core sorting programs on heterogeneous clusters

- Computer ScienceCCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings.
- 2003

Three techniques of parallel external sorting in the context of heterogeneous clusters are explored and it is shown how they can be deployed for clusters with processor performances related by a multiplicative factor.

Parallel Sorting by Approximate Splitting for Multi-core Processors

- Computer Science2010 Third International Joint Conference on Computational Science and Optimization
- 2010

An improved partition method, Parallel Sorting by Approximate Splitting, which is based on an extend pivots selecting algorithm which is more flexibility and efficiency than other algorithm, such as PSRS.

Adaptive data partition for sorting using probability distribution

- Computer Science
- 2004

A new partition method in sorting scenario based on probability distribution is presented, an idea first studied by Janus and Lamagna in early 1980's on a mainframe computer and an efficient implementation on modern, cache-based machines is presented.

Adaptive data partition for sorting using probability distribution

- Computer ScienceInternational Conference on Parallel Processing, 2004. ICPP 2004.
- 2004

A new partition method in sorting scenario based on probability distribution is presented, an idea first studied by Janus and Lamagna in early 1980's on a mainframe computer and an efficient implementation on modern, cache-based machines is presented.

The parameterized Round-Robin partitioned algorithm for parallel external sort

- Computer ScienceProceedings of 9th International Parallel Processing Symposium
- 1995

A new parameterized parallel sort algorithm, called Round-Robin Partitioned (or RRP), for the message passing (shared-nothing) architecture and is shown to be superior to the other algorithms for almost all configurations.

External Sorting for Databases in Distributed Heterogeneous Systems

- Computer Science
- 1993

This paper describes a new, load{balanced external parallel sorting method which is more robust to data skew and to variable speed of processes and compares the run time of the new method with an analogous conventional method in case ofData skew and load imbalances.

## References

SHOWING 1-10 OF 29 REFERENCES

A Low Communication Sort Algorithm for a Parallel Database Machine

- Computer ScienceVLDB
- 1989

This work proposes a novel algorithm that exhibits complete parallelism during the sort, merge, and return-tohost phases, and decreases the amou@ of inter-processor communication compared to existing parallel sort algorithms.

Parallel Partition Sort for Database Machines

- Computer ScienceIWDM
- 1987

A new parallel sorting method, called a parallel partition sort, which transfers only a small amount of data and does not place large demands on the CPU is discussed, based on the top-down partitioning of data.

Parallel algorithms for the execution of relational database operations

- Computer ScienceTODS
- 1983

This paper presents and analyzes algorithms for parallel processing of relational database operations in a general multiprocessor framework, and introduces an analysis methodology which incorporates I/O, CPU, and message costs and which can be adjusted to fit different multiproprocessor architectures.

Sorting Large Files on a Backend Multiprocessor

- Computer ScienceIEEE Trans. Computers
- 1988

The results show that using current, off-the-shelf technology coupled with a streamlined distributed operating system, three- and five-microprocessor configurations, provide a very cost-effective sort of large files.

Percentile Finding Algorithm for Multiple Sorted Runs

- Computer ScienceVLDB
- 1989

An efficient exact method is given which can find any percentile of an arbitrary number of sorted runs and can improve the spcedup for parallel sorting on multiple processors, and target the work to a parallel computer architecture of shared memory MIMD parallel processors.

A comparison of sorting algorithms for the connection machine CM-2

- Computer ScienceSPAA '91
- 1991

A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.

Sampling Issues in Parallel Database Systems

- Computer ScienceEDBT
- 1992

This paper proves that for query size estimation, stratified random sampling guarantees perfect load balancing without reducing the accuracy of the estimate, and that for a given number of I/O operations, page level sampling always produces a more accurate estimate than tuple level sampling.

Parallel sorting and data partitioning by sampling

- Computer Science
- 1983

The analysis is developed for parallel sorting in a local network environment, with distributed data sets in secondary storage devices, and a data partitioning method by sampling is proposed.

Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer

- Computer ScienceIWDM
- 1989

Two external sorting algorithms for hypercube database computers are presented based on partitioning of data according to partition values obtained through sampling of the data.

An Adaptive Method for Unknown Distributions in Distributive Partitioned Sorting

- Computer ScienceIEEE Transactions on Computers
- 1985

An adaptation of DPS, which estimates the cumulative distribution function of the input data from a randomly selected sample, was developed and tested, and runs only 2-4 percent slower than DPS in the uniform case, but outperforms DPS by 12-13 percent on exponentially distributed data for sufficiently large files.