TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System

  title={TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System},
  author={Alexander Rasmussen and George Porter and Michael Conley and Harsha V. Madhyastha and Radhika Niranjan Mysore and Alexander Pucher and Amin Vahdat},
  journal={ACM Trans. Comput. Syst.},
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100TB of input data spread across 832 disks in 52 nodes at a rate of 0.938TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 66% better in absolute performance and has over six times the per-node throughput of the previous record holder. When evaluated against the 100TB Indy JouleSort benchmark, TritonSort… 
SDS-Sort: Scalable Dynamic Skew-aware Parallel Sorting
A new scalable dynamic skew-aware parallel sorting algorithm, named SDS-Sort, which uses a skew- aware partition method to guarantee a tighter upper bound on the workload of each process and provides optimizations, including adaptive local merging, overlapping of data exchange and data processing, and dynamic selection of data processing algorithms.
A hybrid design for high performance large-scale sorting on FPGA
This work proposes a merge sort based hybrid design where the final few stages in the merge sort network are replaced with “folded” bitonic merge networks, and presents a theoretical analysis to quantify latency, memory and throughput of the proposed design.
RTHS: A Low-Cost High-Performance Real-Time Hardware Sorter, Using a Multidimensional Sorting Algorithm
Implementing the RTHS design on a Virtex-7 field-programmable gate array (FPGA) reveals that the number of lookup tables (LUTs) of the proposed method has decreased compared to the conventional Bitonic sorting network (CBSN) and the state-of-the-art PHSA, respectively.
Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA
This paper proposes a streaming permutation network (SPN) by "folding" the classic Clos network and proves that the SPN is programmable to realize all the interconnection patterns in the bitonic sorting network.
I/O chunking and latency hiding approach for out-of-core sorting acceleration using GPU and flash NVM
Results indicate that I/O chunking and latency hiding/overlapping maintains sorting performance, despite slow Flash NVM performance, by utilizing GPUs along with good algorithms.
Computer Generation of High Throughput and Memory Efficient Sorting Designs on FPGA
A hardware generator is developed to automatically build bitonic sorting architectures on FPGAs for given input size, data width and data parallelism and achieves optimal memory efficiency and outperforms the state-of-the-art.
An Efficient Sorting Architecture for Area and Energy Constrained Edge Computing Devices
A new sorting architecture that reduces the number of hardware resources and energy consumption compared to the state-of-the-art sorting architecture and achieves the desired performance using Unary processing is presented.
A New Hardware Accelerator for Data Sorting in Area & Energy Constrained Architectures
A new sorting architecture is presented that reduces the number of required resources compared to the state-of-the-art sorting architecture and achieves the desired performance using Unary processing.
Balancing CPU and Network in the Cell Distributed B-Tree Store
It is observed that combining server-side and client-side processing allows systems to balance and adapt to the available CPU and network resources with minimal configuration, and can free resources for other CPU-intensive work.
Faster: A Low Overhead Framework for Massive Data Analysis
Faster, a low latency distributed processing framework, designed to explore data locality to reduce processing costs in such algorithms, is introduced, which can significantly outperform Spark on large graphs.


TritonSort: A Balanced Large-Scale Sorting System
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in
Alphasort: A cache-sensitive parallel external sort
A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads and argues that modern architectures require algorithm designers to re-examine their use of the memory hierarchy.
High-performance sorting on networks of workstations
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale
The architectural costs of streaming I/O: A comparison of workstations, clusters, and SMPs
It is found that the architectures studied are not well balanced for streaming I/O applications, and the clustered workstations provide higher absolute performance for streamingI/O workloads.
The input/output complexity of sorting and related problems
Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Scalable distributed-memory external sorting
An algorithm whose I/O requirement is close to a lower bound is outlined, in contrast to naive implementations of multiway merging and all other approaches known to us, the algorithm works with just two passes over the data even for the largest conceivable inputs.
Flux: an adaptive partitioning operator for continuous query systems
A dataflow operator called flux is introduced that encapsulates adaptive state partitioning and dataflow routing that can be used for CQ operators under shifting processing and memory loads and can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case.
Sorting on a Cluster Attached to a Storage-Area Network
In November 2004, the SAN Cluster Sort program (SCS) set new records for the Indy versions of the Minute and TeraByte Sorts. SCS ran on a cluster of 40 dual-processor Itanium2 nodes on the show floor
Dryad: distributed data-parallel programs from sequential building blocks
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Nsort: a Parallel Sorting Program for NUMA and SMP Machines
Ordinal TM Nsort TM is a high-performance sort program for SGI IRIX, Sun Solaris and HP-UX servers that can use tens of processors and hundreds of disks to quickly sort and merge data.