Santosh G. Abraham

Learn More
The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese applications and that microarchitecture has a profound impacton achievable MLP. Using the epoch model of MLP, we(More)
Chip multi-threaded (CMT) processors provide support for many simultaneous hardware threads of execution in various ways, including simultaneous multithreading (SMT) and chip multiprocessing (CMP). CMT processors are especially suited to server workloads, which generally have high levels of thread-level parallelism (TLP). In this paper, we describe the(More)
Data prefetching via helper threading has been extensively investigated on Simultaneous Multi- Threading (SMT) or Virtual Multi-Threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most techniques rely on hardware support to reduce context switch overhead between the main thread and helper thread(More)
Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses than the three Cs model. However, current methods for optimal(More)
In this paper, we study the instruction cache miss behavior of four modern commercial applications (a database workload, TPC-W, SPECjAppServer2002 and SPECweb99). These applications exhibit high instruction cache miss rates for both the L1 and L2 caches, and a sizable performance improvement can be achieved by eliminating these misses. We show that it is(More)
Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, we focus on an embedded system consisting of the following(More)
This paper uses bottom-up, static program partitioning to minimize the execution time of parallel programs by reducing interprocessor communication. Program partitioning is applied to a parallel programming construct known as a sequentially iterated parallel loop. This paper develops and evaluates compiler techniques to automatically generate data(More)
Set-associative caches are widely used in CPU memory hierarchies, I/O subsystems, and file systems to reduce average access times. This article proposes an efficient simulation technique for simulating a group of set-associative caches in a single pass through the address trace, where all caches have the same line size but varying associativities and(More)
Automated design tools help to capture the benefits of customization in embedded system design while not exceeding design budgets. Such design tools must understand and exploit the hierarchical structure of design spaces, because systems of any significant complexity typically consist of components (subsystems). In order to reduce the design cost for such(More)