Learn More
With the microprocessor industry's shift to multicore architectures, research in parallel computing is essential to ensure future progress in mainstream computer systems. This in turn requires standard benchmark programs to compare platforms, identify performance bottlenecks, and evaluate potential solutions. Several current benchmark suites provide(More)
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of generalpurpose applications compared to contemporary general-purpose processors (CPUs). This paper uses(More)
SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely(More)
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the(More)
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the(More)
We propose <i>GROPHECY</i>, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration <i>without</i> actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with(More)
Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC.(More)
Manycore processors with wide SIMD cores are becoming a popular choice for the next generation of throughput oriented architectures. We introduce a hardware technique called "diverge on miss" that allows SIMD cores to better tolerate memory latency for workloads with non-contiguous memory access patterns. Individual threads within a SIMD "warp" are allowed(More)
Graphic processors (GPUs), with many light-weight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. To best utilize the GPU’s parallel computing resources, it is crucial to understand how GPU architectures and programming models can be applied to different categories of traditionally CPU(More)
Accelerators such as graphics processors (GPUs) have become increasingly popular for high performance scientific computing. Often, much effort is invested in creating and optimizing GPU code without any guaranteed performance benefit. To reduce this risk, performance models can be used to project a kernel's GPU performance potential before it is ported.(More)