• Corpus ID: 8326819

TritonSort: A Balanced Large-Scale Sorting System

@inproceedings{Rasmussen2011TritonSortAB,
  title={TritonSort: A Balanced Large-Scale Sorting System},
  author={Alexander Rasmussen and George Porter and Michael Conley and Harsha V. Madhyastha and Radhika Niranjan Mysore and Alexander Pucher and Amin Vahdat},
  booktitle={NSDI},
  year={2011}
}
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in 52 nodes at a rate of 0.916 TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 60% better in absolute performance and has over six times the per-node efficiency of the previous record holder. In this paper, we describe the hardware and software architecture… 
TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System
TLDR
This article describes the hardware and software architecture necessary to operate TritonSort, a highly efficient, scalable sorting system designed to process large datasets, and is able to sort data at approximately 80% of the disks’ aggregate sequential write speed.
Algorithms for high-throughput disk-to-disk sorting
TLDR
A new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers, is presented, able to almost completely hide the computation (sorting) behind the IO latency.
Efficient disk-to-disk sorting: a case study in the decoupled execution paradigm
TLDR
An optimized algorithm is proposed that uses almost all features of DEP pushing the performance of sorting in HPC even further compared to other existing solutions, achieving 30% better performance compared to the theoretically optimal sorting algorithm running on the same testbed but not designed to exploit the DEP architecture.
Riffle: optimized shuffle service for large-scale data analytics
TLDR
Riffle is presented, an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data.
Flat Datacenter Storage
TLDR
The FDS-based sort application which set the 2012 world record for disk-to-disk sorting is described, and single-process read and write performance of more than 2GB/s is described.
HykSort: a new variant of hypercube quicksort on distributed memory architectures
TLDR
HekSort is an optimized comparison sort for distributed memory architectures that attains more than 2× improvement over bitonic sort and samplesort and also presents a staged communication samplesort, which is more robust than the original samplesort for large core counts.
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster
TLDR
This paper presents CloudRAMSort, a fast and efficient system for large-scale distributed sorting on shared-nothing clusters that maximizes per-node efficiency by exploiting modern architectural features such as multiple cores and SIMD units, and provides a detailed analytical model that accurately projects the performance of CloudRAMsort with varying tuple sizes and interconnect bandwidths.
MilliSort and MilliQuery: Large-Scale Data-Intensive Computing in Milliseconds
TLDR
This paper explores the possibility of flash bursts: applications that use a large number of servers but for very short time intervals (as little as one millisecond) and developed two new benchmarks, MilliSort and MilliQuery.
FANS: FPGA-Accelerated Near-Storage Sorting
TLDR
FANS is proposed, an FPGA accelerated near-storage sorting system which selects the optimized design configuration and achieves the theoretically maximum end-to-end performance when using a single Samsung SmartSSD device.
Exoshuffle: Large-Scale Shuffle at the Application Level
TLDR
This work argues that the inflexibility stems from the tight coupling of shuffie algorithms and system-level optimizations, and proposes to use the distributed futures abstraction to decouple shuf fine-grained pipelining from the system.
...
...

References

SHOWING 1-10 OF 24 REFERENCES
High-performance sorting on networks of workstations
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale
Alphasort: A cache-sensitive parallel external sort
TLDR
A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads and argues that modern architectures require algorithm designers to re-examine their use of the memory hierarchy.
The input/output complexity of sorting and related problems
TLDR
Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
The architectural costs of streaming I/O: A comparison of workstations, clusters, and SMPs
TLDR
It is found that the architectures studied are not well balanced for streaming I/O applications, and the clustered workstations provide higher absolute performance for streamingI/O workloads.
Scalable distributed-memory external sorting
TLDR
An algorithm whose I/O requirement is close to a lower bound is outlined, in contrast to naive implementations of multiway merging and all other approaches known to us, the algorithm works with just two passes over the data even for the largest conceivable inputs.
Sorting on a Cluster Attached to a Storage-Area Network
In November 2004, the SAN Cluster Sort program (SCS) set new records for the Indy versions of the Minute and TeraByte Sorts. SCS ran on a cluster of 40 dual-processor Itanium2 nodes on the show floor
Nsort: a Parallel Sorting Program for NUMA and SMP Machines
TLDR
Ordinal TM Nsort TM is a high-performance sort program for SGI IRIX, Sun Solaris and HP-UX servers that can use tens of processors and hundreds of disks to quickly sort and merge data.
Dryad: distributed data-parallel programs from sequential building blocks
TLDR
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Efficiency matters!
TLDR
There is a pressing need to rethink the design of future data intensive computing systems, focusing on scalability without considering efficiency, and consider the direction of future research.
The Gamma Database Machine Project
TLDR
The design of the Gamma database machine and the techniques employed in its implementation are described and a thorough performance evaluation of the iPSC/s hypercube version of Gamma is presented.
...
...