Effective Hardware Based Data Prefetching for High-Performance Processors

@article{Chen1995EffectiveHB,
  title={Effective Hardware Based Data Prefetching for High-Performance Processors},
  author={Tien-Fu Chen and Jean-Loup Baer},
  journal={IEEE Trans. Computers},
  year={1995},
  volume={44},
  pages={609-623}
}
Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. The three designs differ… 
When Caches are not Enough : A Review of Data Prefetching Techniques
TLDR
To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead, and when these requirements are met,Prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses.
Prefetching for the Kilo-Instruction Processor
TLDR
The results show that the prefetching scheme effectively eliminates a major portion of data access penalty for a uniprocessor environment but provides less than 15% speedup improvement when applied to the Kilo-Instruction processor.
An automated method for software controlled cache prefetching
  • D. Zucker, R. Lee, M. Flynn
  • Computer Science
    Proceedings of the Thirty-First Hawaii International Conference on System Sciences
  • 1998
TLDR
The software prefetching technique presented is motivated by emulation of a hardware stride prediction table (SPT) and performance similar, and in some cases superior, to the hardware based technique is achieved with no additional hardware costs.
The Impact of Timeliness for Hardware-based Prefetching from Main Memory
TLDR
The importance of timeliness is shown by simulating prefetch oracles with perfect coverage and accuracy and it is shown that in order to approach completely hiding the memory latency even under perfect conditions, prefetches must be initiated more than one L2 cache miss ahead.
Storage-Efficient Data Prefetching for High Performance Computing
TLDR
This work proposes a novel Dynamic Signature Method (DSM) that stores the addresses efficiently to reduce the demand of storage for prefetching and shows that the new DSM based prefetcher achieved better performance improvement for over half benchmarks compared to the existingPrefetching approaches with the same storage consumption.
A Survey of Data Prefetching Techniques
TLDR
Several alternative approaches are examined and the design tradeoffs involved when implementing a data prefetch strategy are discussed, showing the potential to significantly improve overall program execution time by overlapping computation with memory accesses.
Data prefetch mechanisms
TLDR
To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead, and secondary effects such as cache pollution and increased memory bandwidth requirements must be taken into consideration.
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
TLDR
This paper presents extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which most of the relevant parameters are varied in order to determine when and if hardware prefetching is useful.
Tango: a hardware-based data prefetching technique for superscalar processors
  • S. Pinter, A. Yoaz
  • Computer Science
    Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29
  • 1996
TLDR
A new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors is presented and a new hardware construct, the program progress graph (PPG), is suggested as a simple extension to the branch target buffer (BTB).
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Data prefetching for high-performance processors
TLDR
This dissertation proposes and evaluates data prefetching techniques that address the data access penalty problems and suggests an approach that combines software and hardware schemes is shown to be very promising for reducing the memory latency with the least overhead.
Design and evaluation of a compiler algorithm for prefetching
TLDR
This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
An architecture for software-controlled data prefetching
  • A. Klaiber, H. Levy
  • Computer Science
    [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture
  • 1991
TLDR
Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.
Compiler-directed data prefetching in multiprocessors with memory hierarchies
TLDR
An algorithm for finding the earliest point in a program that a block of data can be prefetched, based on the control and data dependencies in the program, is presented, an integral part of more general memory management algorithms.
Reducing memory latency via non-blocking and prefetching caches
TLDR
A hybrid design based on the combination of non-blocking and prefetching caches is proposed, which is found to be very effective in reducing the memory latency penalty for many applications.
Data prefetching in multiprocessor vector cache memories
  • John W. C. Fu, J. Patel
  • Computer Science
    [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture
  • 1991
TLDR
This paper reports the cache performance of a set of vectorized numerical program from the Perfect Club benchmarks and describes two simple prefetch schemes to reduce the influence of long stride vector accesses and misses due IO block invalidations in mulliprocessor vector caches.
Lockup-free instruction fetch/prefetch cache organization
TLDR
A cache organization is presented that essentially eliminates a penalty on subsequent cache references following a cache miss and has been incorporated in a cache/memory interface subsystem design, and the design has been implemented and prototyped.
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching
TLDR
This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors and shows that a small data cache with compiler-assisted data preferences can achieve a performance level close to that of an ideal cache.
: Data Prefetching In Shared Memory Multiprocessors
TLDR
Using the multiprocessor cache model for comparison, data prefetching is found to be more effective than caches in addressing the memory access bottleneck.
...
...