Learn More
We present a hybrid CUDA-MPI sorting algorithm that makes use of GPU clusters to sort large data sets. Our algorithm has two phases. In the first phase each node sorts a portion of the data on its GPU using a parallel bitonic sort. In the second phase the sorted subsequences are merged together in parallel using a reduction sorting network implemented in(More)
—This paper presents an evaluation and comparison of three topologies that are popular for building interconnection networks in large-scale supercomputers: torus, fat-tree, and dragonfly. To perform this evaluation, we propose a comprehensive methodology and present a scalable packet-level network simulator, TraceR. Our methodology includes design of(More)
The purpose of this project is to develop a system that can auto-tune the sorting of a given set of data. Auto-tuning involves empirically searching through a set of parameters such that a given set of data can be sorted as fast as possible. The parameters involved in this empirical search include the type of sorting algorithm used, the type of hardware(More)
  • 1