Scalable hardware memory disambiguation for high ILP processors

  title={Scalable hardware memory disambiguation for high ILP processors},
  author={Simha Sethumadhavan and Rajagopalan Desikan and Doug Burger and Charles R. Moore and Stephen W. Keckler},
  journal={Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36.},
This paper describes several methods for improving the scalability of memory disambiguation hardware for future high ILP processors. As the number of in-flight instructions grows with issue width and pipeline depth, the load/store queues (LSQ) threaten to become a bottleneck in both power and latency. By employing lightweight approximate hashing in hardware with structures called Bloom filters, many improvements to the LSQ are possible. We propose two types of filtering schemes using Bloom… 
4 Citations
Runtime dependency analysis for loop pipelining in High-Level Synthesis
This work proposes to address this issue by leveraging well-known techniques used in superscalar processors to perform runtime memory disambiguation, and demonstrates significant performance improvements for a moderate increase in area while retaining portability among HLS tools.
Reducing Cache Hierarchy Energy Consumption by Predicting Forwarding and Disabling Associative Sets
A straightforward filtering technique based on a highly accurate forwarding predictor predicts whether a load instruction will obtain its corresponding data via forwarding from the load-store structure - thus avoiding the data cache access - or if it will be provided by the data Cache.
A performance-correctness explicitly-decoupled architecture
This paper proposes to separate performance goals from the correctness goal using an explicitly-decoupled architecture and shows that such a decoupled design allows significant optimization benefits and is much less sensitive to conservatism applied in the correctness domain.
Partition the Banks , not the Functionality , of Large-Window Load / Store Queues
A family of distributed load/store queue designs that avoid the need for partitioning the LSQ functio nality, but which achieve comparable energy efficiency with performance comparable to an ideal LSQ are described.