# A super scalar sort algorithm for RISC processors

@inproceedings{Agarwal1996ASS, title={A super scalar sort algorithm for RISC processors}, author={Ramesh C. Agarwal}, booktitle={SIGMOD '96}, year={1996} }

The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism…

## Tables and Topics from this paper

## 71 Citations

Accessing hardware performance counters in order to measure the influence of cache on the performance of integer sorting

- Computer ScienceProceedings International Parallel and Distributed Processing Symposium
- 2003

It is demonstrated through experiments on an Athlon processor that a good balance between L1 data cache misses and retired instructions provides the fastest algorithm for sorting in practical cases and a new flavour of merge-sort is developed and it beats its rival.

Fast parallel in-memory 64-bit sorting

- Computer ScienceICS '01
- 2001

A new algorithm that is more than 2 times faster than the previous fastest 64-bit parallel sorting algorithm, PCS-Radix sort, which adapts to any parallel computer by changing three simple algorithmic parameters.

The effect of local sort on parallel sorting algorithms

- Computer ScienceProceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing
- 2002

There are three important contributions in SCS-Radix sort: first, the work saved by detecting data skew dynamically; second, the exploitation of the memory hierarchy done by the algorithm; and third, the execution time stability of SCS -Radix when sorting data sets with different characteristics.

Sequential in-core sorting performance for a SQL data service and for parallel sorting on heterogeneous clusters

- Computer ScienceFuture Gener. Comput. Syst.
- 2006

It is shown, through fine experiments on an Athlon processor, that L1 data cache misses are not the central problem, but a subtle proportion of independent retired instructions should be advised to get performance for in-core sorting.

Super Scalar Sample Sort

- Physics, Computer ScienceESA
- 2004

The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions, which facilitates optimizations like loop unrolling and software pipelining.

CC-Radix: a cache conscious sorting based on Radix sort

- Computer ScienceEleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings.
- 2003

CC-Radix improves the data locality by dynamically partitioning the data set into subsets that fit in cache level L/sub 2/.

Block oriented processing of relational database operations in modern computer architectures

- Computer ScienceProceedings 17th International Conference on Data Engineering
- 2001

It is argued that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance.

High-performance sorting on networks of workstations

- Computer ScienceSIGMOD '97
- 1997

We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale…

Communication conscious radix sort

- Computer ScienceICS '99
- 1999

A reorganization of Radix sort is proposed that leads to a highly local version of the algorithm at a very low cost and achieves a good load balance which makes it insensitive to skewed data distributions.

Sorting on the SGI Origin 2000: comparing MPI and shared memory implementations

- Computer ScienceProceedings. SCCC'99 XIX International Conference of the Chilean Computer Science Society
- 1999

This paperalyses the C/sup 3/-Radix (Communication- and Cache-Conscious Radix) sort algorithm, using the distributed and the shared memory parallel programming models, and explains the reasons for the different behaviours depending on the size of the data sets.

## References

SHOWING 1-10 OF 21 REFERENCES

AlphaSort: a RISC machine sort

- Computer ScienceSIGMOD '94
- 1994

A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads and proposes two new benchmarks: Minutesort: how much can you sort in a minute, and DollarSort: how to sort for a dollar.

Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

- Computer ScienceIBM J. Res. Dev.
- 1994

The paper gives two examples that illustrate how the algorithms and architectural features interplay to produce high-performance codes and included in ESSL (Engineering and Scientific Subroutine Library); an overview of ESSL is also given.

Sorting Large Data Files on POOMA

- Computer ScienceCONPAR
- 1990

The results show that the benchmark is able to exploit the full capabilities of the computing power, the storage devices and the communication bandwith and the applicability of the POOMA platform for this application, even where the POOL implementation was, at the time of the experiment, far from optimal.

Parallel sorting on a shared-nothing architecture using probabilistic splitting

- Computer Science[1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems
- 1991

The authors consider the problem of external sorting in a shared-nothing multiprocessor with two techniques for determining ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by Iyer, Ricard, and Varman; and probabilistic splitting, which uses sampling to estimate quantiles.

Characterization of alpha AXP performance using TP and SPEC workloads

- Computer ScienceISCA '94
- 1994

A simple model for evaluating the effects of various design tradeoffs based on the data collected by using hardware monitors is proposed and indicates that Alpha AXP takes advantage of lower cycles per instruction and cycle time to achieve a significant performance advantage.

High-Performance Parallel Implementations of the NAS Kernel Benchmarks on the IBM SP2

- Computer ScienceIBM Syst. J.
- 1995

This paper describes the parallel implementation of the five kernel benchmarks from this suite on the IBM SP2™, a scalable, distributed memory parallel computer.

The Nas Parallel Benchmarks

- Computer ScienceInt. J. High Perform. Comput. Appl.
- 1991

A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.

A measure of transaction processing power

- Computer Science, Business
- 1985

These benchmarks measure the performance of diverse transaction processing systems and a standard system cost measure is stated and used to define price/performance metrics.

Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer

- Computer ScienceIWDM
- 1989

Two external sorting algorithms for hypercube database computers are presented based on partitioning of data according to partition values obtained through sampling of the data.

The benchmark handbook for database and transaction processing systems

- Computer Science
- 1991

Transaction Processing Performance Council (TPC) is a non-profit to define transaction processing and database benchmarks and to disseminate TPC benchmarks are used in evaluating the performance of…