Die-Stacked DRAM caches offer the promise of improved performance and reduced energy by capturing a larger fraction of an application's working set than on-die SRAM caches. However, given that their latency is only 50% lower than that of main memory, DRAM caches considerably increase latency for misses. They also incur a significant energy overhead for… (More)
Online transaction processing (OLTP) is at the core of many data center applications. OLTP workloads are known to have large instruction footprints that foil existing L1 instruction caches resulting in poor overall performance. Prefetching can reduce the impact of such instruction cache miss stalls, however, state-of-the-art solutions require large… (More)
Online transaction processing (OLTP) workload performance suffers from instruction stalls; the instruction footprint of a typical transaction exceeds by far the capacity of an L1 cache, leading to ongoing cache thrashing. Several proposed techniques remove some instruction stalls in exchange for error-prone instrumentation to the code base, or a sharp… (More)
Recent studies highlight that traditional transaction processing systems utilize the micro-architectural features of modern processors very poorly. L1 instruction cache and long-latency data misses dominate execution time. As a result , more than half of the execution cycles are wasted on memory stalls. Previous works on reducing stall time aim at improving… (More)
Modern smartphones comprise several processing and input/output units that communicate mostly through main memory. As a result, memory represents a critical performance bottleneck for smartphones. This work<sup>1</sup> introduces a set of emerging workloads for smartphones and characterizes the performance of several memory controller policies and… (More)
During an instruction miss a processor is unable to fetch instructions. The more frequent instruction misses are the less able a modern processor is to find useful work to do and thus performance suffers. Online transaction processing (OLTP) suffers from high instruction miss rates since the instruction footprint of OLTP transactions does not fit in today's… (More)
This work revisits precomputation prefetching targeting long access latency loads with access patterns that are hard to predict. It presents Ekivolos, a precomputation prefetcher system that automatically builds prefetching slices that contain enough control flow instructions to faithfully and autonomously recreate the program's access behavior without… (More)
Research Interests Computer architecture, many-core architectures, on-chip networks, cache coherence protocols, inter-connection networks, emerging applications for many-core architectures.