Learn More
The client computing platform is moving towards a heterogeneous architecture consisting of a combination of cores focused on scalar performance, and a set of throughput-oriented cores. The throughput oriented cores (e.g. a GPU) may be connected over both coherent and non-coherent interconnects, and have different ISAs. This paper describes a programming(More)
Hardware trends suggest that large-scale CMP architectures, with tens to hundreds of processing cores on a single piece of silicon, are iminent within the next decade. While existing CMP machines have traditionally been handled in the same way as SMPs, this magnitude of parallelism introduces several fundamental challenges at the architectural level and(More)
<i>Computation-reuse and value-prediction are two recent techniques for improving microprocessor performance by exploiting value localities. They both aim at breaking the data dependence limit in traditional processors. In this paper, we propose a speculative multithreading scheme in which the same hardware can be efficiently used for both computation reuse(More)
The performance of single-threaded programs and legacy binary code is of critical importance in many everyday applications. However, neither can hardware multi-core processors directly speed up single-threaded programs, nor can software automatic parallelizing compilers effectively parallelize legacy binary code and irregular applications. In this paper, we(More)
Advanced microprocessors have been increasing clock rates, well beyond the Gigahertz boundary. For such high performance microprocessors, a small and fast data micro cache (ucache) is important to overall performance, and proper management of it via load bypassing has a significant performance impact. In this paper, we propose and evaluate a(More)
64-bit processor architectures like the Intel&#174; Itanium&#174;Processor Family are designed for large applicationsthat need large memory addresses.When runningapplications that fit within a 32-bit address space, 64-bitCPUs are at a disadvantage compared to 32-bit CPUsbecause of the larger memory footprints for their data.This results in worse cache and(More)
Dynamic binary translators use a two-phase approachto identify and optimize frequently executed codedynamically. In the first step (profiling phase), blocks ofcode are interpreted or quickly translated to collectexecution frequency information for the blocks. In thesecond phase (optimization phase), frequently executedblocks are grouped into regions and(More)
Compiler-directed data speculation has been implemented on Itanium systems to allow for a compiler 10 move a load across a store even when the two operations are potentially aliased This not only breaks data dependency to reduce critical path length, but also allows a load to be scheduled far apart from its uses to hide cache miss latencies. However, the(More)