Learn More
This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to(More)
K ey to the central promise inherent in Java technology— " write once, run anywhere " —is the fact that Java programs run on the Java virtual machine, insulating them from any contact with the underlying hardware. Consequently, Java programs must execute indirectly through a translation layer built into the Java virtual machine. This translator can take(More)
Die-stacking technology allows conventional DRAM to be integrated with processors. While numerous opportunities to make use of such stacked DRAM exist, one promising way is to use it as a large cache. Although previous studies show that DRAM caches can deliver performance benefits, there remain inefficiencies as well as significant hardware costs for(More)
GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache(More)
While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors (CMPs), GPU architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead to existing GPU applications. Moreover, these protocols increase the verification complexity of the(More)
The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to consistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC applications as well. Server workloads are inherently parallel;(More)
In widely distributed systems generally, and in science-oriented Grids in particular, software, CPU time, storage, etc., are treated as " services " – they can be allocated and used with service guarantees that allows them to be integrated into systems that perform complex tasks. Network communication is currently not a service – it is provided, in general,(More)
This paper considers optimal control of a system of nonholonomic robots. Control effort and deviation from a desired formation are minimized for the system of robots traveling between specified initial and final configurations, with a bifurcation parameter which is the relative weighting assigned to the control effort versus the formation constraint. The(More)
Systems from smartphones to supercomputers are increasingly heterogeneous, being composed of both CPUs and GPUs. To maximize cost and energy efficiency, these systems will increasingly use globally-addressable heterogeneous memory systems, making choices about memory page placement critical to performance. In this work we show that current page placement(More)