Learn More
A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The <i>Tagless(More)
Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed techniques for coherence traffic reduction and prefetching suggest that further useful patterns emerge with a macroscopic, coarse-grain view. To exploit coarse- grain behavior, previous work extended conventional caches(More)
Die-Stacked DRAM caches offer the promise of improved performance and reduced energy by capturing a larger fraction of an application's working set than on-die SRAM caches. However, given that their latency is only 50% lower than that of main memory, DRAM caches considerably increase latency for misses. They also incur a significant energy overhead for(More)
Virtualization has become a magic bullet to increase utilization, improve security, lower costs, and reduce management overheads. In many scenarios, the number of virtual machines consolidated onto a single processor has grown even faster than the number of hardware threads. This results in multiprogrammed virtualization where many virtual machines(More)
Modern smartphones comprise several processing and input/output units that communicate mostly through main memory. As a result, memory represents a critical performance bottleneck for smartphones. This work<sup>1</sup> introduces a set of emerging workloads for smartphones and characterizes the performance of several memory controller policies and(More)
The replacement policies commonly used in modern processors perform an average of 57% worse than an optimal replacement policy for commercial applications using large, shared caches in a chip-multiprocessor (CMP). Recent proposals that improve the performance of smaller, uniprocessor caches with SPEC CPU workloads do not achieve similar benefits with(More)
Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed memory hierarchy enhancements for coherence traffic reduction and prefetching suggest that additional useful patterns emerge with a macroscopic, coarse-grain view. This paper presents RegionTracker, a dual-grain,(More)
On-chip last-level caches are increasing to tens of megabytes to accommodate applications with large memory footprints and to compensate for high memory latencies and limited off-chip bandwidth. This paper reviews two on-going research efforts that exploit such large caches: coarse-grain cache management, and predictor virtualization. Coarse-grain cache(More)
  • 1