Learn More
The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese applications and that microarchitecture has a profound impacton achievable MLP. Using the epoch model of MLP, we(More)
The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (very-long instruction word) processors,(More)
Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses than the three Cs model. However, current methods for optimal(More)
Data prefetching via helper threading has been extensively investigated on Simultaneous Multi- Threading (SMT) or Virtual Multi-Threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most techniques rely on hardware support to reduce context switch overhead between the main thread and helper thread(More)
Automated design tools help to capture the benefits of customization in embedded system design while not exceeding design budgets. Such design tools must understand and exploit the hierarchical structure of design spaces, because systems of any significant complexity typically consist of components (subsystems). In order to reduce the design cost for such(More)
This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store(More)
Due to increasing cache-miss latencies, cache control instructions are being implemented for future systems. We study the memory referencing behavior of individual machine-level instructions using simulations of fully-associative caches under MIN replacement. Our objective is to obtain a deeper understanding of useful program behavior that can be eventually(More)
To enable concurrent instruction execution, scientific computers generally rely on pipelining, which combines with faster system clocks to achieve greater throughput. Each concurrently executing instruction requires buffer space, usually implemented as a register, to receive its result. This paper focuses on the issue of how many registers are required to(More)
With processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW) prefetchers, which act to reduce the missing loads observed by an application.This paper analyzes the(More)