Rajagopalan Desikan

Learn More
Growing on-chip wire delays will cause many future microarchitectures to be distributed, in which hardware resources within a single processor become nodes on one or more switched micronetworks. Since large processor cores will require multiple clock cycles to traverse, control must be distributed, not centralized. This paper describes the control protocols(More)
This paper describes several methods for improving thescalability of memory disambiguation hardware for futurehigh ILP processors. As the number of in-flight instructionsgrows with issue width and pipeline depth, the load/storequeues (LSQ) threaten to become a bottleneck in both powerand latency. By employing lightweight approximate hashingin hardware with(More)
Impediments to main memory performance have traditionally been due to the divergence in processor versus memory speed and the pin bandwidth limitations of modern packaging technologies. In this paper we evaluate a magneto-resistive memory (MRAM)-based hierarchy to address these future constraints. MRAM devices are non-volatile, and have the potential to be(More)
Note: The compact format for this draft version of the dissertation is motivated by the need to save paper and have a document that can be stapled and carried easily in a stack of papers. A version using the official UT dissertation format is available to committee members upon request. Acknowledgments In my research, I have received assistance from many(More)
Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions affected by the mis-speculation, instead of all instructions. In this paper we introduce a new selective(More)
This short paper serves to correct the errors contained in the paper entitled "Measuring Experimental Error in Microprocessor Simulation," presented at the 2001 International Symposium on Computer Architecture (ISCA-28) [2]. That paper contained a study of a validated microarchitectural simulator called sim-alpha, and included a case study that compared(More)
Transmission of cache lines in cache-coherent shared memory machines is necessary for communication but can cause significant latencies across the system. The ongoing growth in cache capacities shifts the distribution of cache misses from capacity and conflict misses to coherence misses, which consist of misses caused by both true and false sharing. In this(More)
In this paper, we describe a lightweight protocol to support selective re-execution on the TRIPS processor. The protocol permits multiple waves of speculation to be traversing a dataflow graph simultaneously and in any order, with a cleanup " commit " wave propagating as well to determine completion of a group of instructions. The protocol is completely(More)
False sharing of data is an important phenomenon affecting performance in shared memory multiprocessors. False sharing results in unnecessary coherency overhead by causing invalidation of the shared cache line, and increasing the latency of the load accessing the cache line. As microprocessors incorporate increasingly large caches with large cache lines,(More)
— In this paper, we describe the design and implementation of the primary memory system of the TRIPS processor. To match the aggressive execution bandwidth and support high levels of memory parallelism, the primary memory system is completely partitioned into four banks, can support up to 256 in-flight memory instructions, aggressive reordering of in-flight(More)