Learn More
Sorting is a kernel algorithm for a wide range of applications. We present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly,(More)
Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data(More)
Moore’s Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this(More)
to date, most of many-core prototypes employ tiled topologies connected through on-chip networks. The throughput and latency of the on-chip networks usually become to the bottleneck to achieve peak performance especially for communication intensive applications. Most of studies are focus on on-chip networks only, such as routing algorithms or router(More)
The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained synchronization is not suitable for this situation. However, the overhead of fine-grained synchronization is still small when the communication(More)
Computer architectures make a dramatic turn away from improving single-processor performance towards improved parallel performance through integrating many cores in one chip. However, providing directory based coherence protocols for these platforms is too complex and expensive. As a substitute, we propose a synchronization based cache coherence solution,(More)
Location consistency (LC) is a weak memory consistency model which is defined entirely on partial order execution semantics of parallel programs. Compared with sequential consistency (SC), LC is scalable and provides ample theoretical parallelism. This makes LC an interesting memory model in the upcoming many-core parallel processing era. Previous work has(More)
Conflict can decrease performance of computer severely, such as bank conflicts reduce bandwidth of interleave multibank memory systems and conflict misses reduce effective on-chip capacity, and this incurs much conflict miss further. Conflicts can be avoided by a suitable address mapping scheme which maps the most frequently occurring patterns(More)