A super scalar sort algorithm for RISC processors

@inproceedings{Agarwal1996ASS,
  title={A super scalar sort algorithm for RISC processors},
  author={Ramesh C. Agarwal},
  booktitle={SIGMOD '96},
  year={1996}
}
  • R. Agarwal
  • Published in SIGMOD '96 1 June 1996
  • Computer Science
The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism… 
Accessing hardware performance counters in order to measure the influence of cache on the performance of integer sorting
TLDR
It is demonstrated through experiments on an Athlon processor that a good balance between L1 data cache misses and retired instructions provides the fastest algorithm for sorting in practical cases and a new flavour of merge-sort is developed and it beats its rival.
Fast parallel in-memory 64-bit sorting
TLDR
A new algorithm that is more than 2 times faster than the previous fastest 64-bit parallel sorting algorithm, PCS-Radix sort, which adapts to any parallel computer by changing three simple algorithmic parameters.
The effect of local sort on parallel sorting algorithms
TLDR
There are three important contributions in SCS-Radix sort: first, the work saved by detecting data skew dynamically; second, the exploitation of the memory hierarchy done by the algorithm; and third, the execution time stability of SCS -Radix when sorting data sets with different characteristics.
Sequential in-core sorting performance for a SQL data service and for parallel sorting on heterogeneous clusters
TLDR
It is shown, through fine experiments on an Athlon processor, that L1 data cache misses are not the central problem, but a subtle proportion of independent retired instructions should be advised to get performance for in-core sorting.
Super Scalar Sample Sort
TLDR
The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions, which facilitates optimizations like loop unrolling and software pipelining.
CC-Radix: a cache conscious sorting based on Radix sort
TLDR
CC-Radix improves the data locality by dynamically partitioning the data set into subsets that fit in cache level L/sub 2/.
Block oriented processing of relational database operations in modern computer architectures
TLDR
It is argued that a block-oriented processing strategy for database operations can lead to better utilization of the processors and caches, generating significantly higher performance.
High-performance sorting on networks of workstations
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale
Communication conscious radix sort
TLDR
A reorganization of Radix sort is proposed that leads to a highly local version of the algorithm at a very low cost and achieves a good load balance which makes it insensitive to skewed data distributions.
Sorting on the SGI Origin 2000: comparing MPI and shared memory implementations
TLDR
This paperalyses the C/sup 3/-Radix (Communication- and Cache-Conscious Radix) sort algorithm, using the distributed and the shared memory parallel programming models, and explains the reasons for the different behaviours depending on the size of the data sets.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
AlphaSort: a RISC machine sort
TLDR
A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads and proposes two new benchmarks: Minutesort: how much can you sort in a minute, and DollarSort: how to sort for a dollar.
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
TLDR
The paper gives two examples that illustrate how the algorithms and architectural features interplay to produce high-performance codes and included in ESSL (Engineering and Scientific Subroutine Library); an overview of ESSL is also given.
Sorting Large Data Files on POOMA
TLDR
The results show that the benchmark is able to exploit the full capabilities of the computing power, the storage devices and the communication bandwith and the applicability of the POOMA platform for this application, even where the POOL implementation was, at the time of the experiment, far from optimal.
Parallel sorting on a shared-nothing architecture using probabilistic splitting
  • D. DeWitt, J. Naughton, D. Schneider
  • Computer Science
    [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems
  • 1991
TLDR
The authors consider the problem of external sorting in a shared-nothing multiprocessor with two techniques for determining ranges of sort keys: exact splitting, using a parallel version of the algorithm proposed by Iyer, Ricard, and Varman; and probabilistic splitting, which uses sampling to estimate quantiles.
Characterization of alpha AXP performance using TP and SPEC workloads
TLDR
A simple model for evaluating the effects of various design tradeoffs based on the data collected by using hardware monitors is proposed and indicates that Alpha AXP takes advantage of lower cycles per instruction and cycle time to achieve a significant performance advantage.
High-Performance Parallel Implementations of the NAS Kernel Benchmarks on the IBM SP2
TLDR
This paper describes the parallel implementation of the five kernel benchmarks from this suite on the IBM SP2™, a scalable, distributed memory parallel computer.
The Nas Parallel Benchmarks
TLDR
A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
A measure of transaction processing power
TLDR
These benchmarks measure the performance of diverse transaction processing systems and a standard system cost measure is stated and used to define price/performance metrics.
Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer
TLDR
Two external sorting algorithms for hypercube database computers are presented based on partitioning of data according to partition values obtained through sampling of the data.
The benchmark handbook for database and transaction processing systems
Transaction Processing Performance Council (TPC) is a non-profit to define transaction processing and database benchmarks and to disseminate TPC benchmarks are used in evaluating the performance of
...
1
2
3
...