Compiler-based prefetching for recursive data structures

  title={Compiler-based prefetching for recursive data structures},
  author={C. K. Luk and Todd C. Mowry},
  booktitle={ASPLOS VII},
  • C. LukT. Mowry
  • Published in ASPLOS VII 1 October 1996
  • Computer Science
Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications has remained largely unexplored. This paper investigates compiler-based prefetching for pointer-based applications---in particular, those containing recursive data structures. We identify the fundamental… 

Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

The scope of automatic compiler-inserted prefetching is expanded to also include the recursive data structures commonly found in pointer-based applications, and the most widely applicable scheme (greedyPrefetching) is proposed and automated in an optimizing research compiler.

Profile-guided post-link stride prefetching

This study uses profiling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler, and uses the strides discovered to insert prefetches into the executable directly, without the need for re-compilation.

Library-based Prefetching for Pointer-intensive Applications

A novel softwarePrefetching scheme for pointer-based datastructures in which prefetching is performed by a helper thread included in the data-structure library code in which the user application is not modified at all, and the benefits are robust across a range of memory system and application parameters without the need for recompilation.

A stateless, content-directed data prefetching mechanism

Content-Directed Data Prefetching is proposed, a data prefetching architecture that exploits the memory allocation used by operating systems and runtime systems to improve the performance of pointer-intensive applications constructed using modern language systems.

Bandwidth-Based Prefetching for Constant-Stride Arrays

A new algorithm for prefetching arrays that are accessed with a compile-time known constant stride is described, which generates prefetches that are more efficient than the standard algorithm because it avoids cache conflicts and issuesPrefetches based on the machine’s ability to process memory transactions in parallel.

Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor

  • Harold W. CainP. Nagpurkar
  • Computer Science
    2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS)
  • 2010
It is found that the POWER6 implementation of runahead prefetching is quite effective on many of the memory intensive applications studied; in isolation it improves performance as much as 36% and on average 10%.

Software caching vs. prefetching

This paper investigates the technique of software caching for applications that perform searches or sorted insertions and finds that for applications involving a search, software caching performs as high as 30% better than the original application.

Tolerating latency in multiprocessors through compiler-inserted prefetching

The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses, and can improve the speed of some parallel applications by as much as a factor of two.

Data Prefetching for Non-Linear Memory References

A new data prefetch scheme, called the Reference Value Prediction Caching (RVPC), is proposed in this paper, and it was showed that significant reduction in memory latency can be expected from the RVPC scheme, especially those applications with pointers.

A Practical Stride Prefetching Implementation in Global Optimizer

A new inductive data prefetching algorithm implemented in the global optimizer based on demand driven speculative recognition of inductive expressions equals to strongly connected component detection in data flow graph, thus eliminating the need to invoke the loop nest optimizer.



Design and evaluation of a compiler algorithm for prefetching

This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.

Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

  • Zheng ZhangJ. Torrellas
  • Computer Science
    Proceedings 22nd Annual International Symposium on Computer Architecture
  • 1995
This paper presents a new prefetching scheme that, while usable by regular applications, is specifically targeted to irregular ones: memory binding and groupPrefetching, to hardware-bind and prefetch together groups of data that the programmer suggests are strongly related to each other.

Tolerating latency through software-controlled data prefetching

This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code that attempts to minimize overheads by only issuing prefetched for references that are predicted to suffer cache misses, and investigates the architectural support necessary to make prefetching effective.

Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors and shows that a small data cache with compiler-assisted data preferences can achieve a performance level close to that of an ideal cache.

Compiler optimizations for improving data locality

This paper presents compiler optimizations to improve data locality based on a simple yet accurate cost model and demonstrates that these program transformations are useful for optimizing many programs.

Cache miss heuristics and preloading techniques for general-purpose programs

This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs that 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency.

Software prefetching

These simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.

A general data dependence test for dynamic, pointer-based data structures

This paper presents a new technique for performing more accurate data dependence testing in the presence of dynamic, pointer-based data structures, and demonstrates its effectiveness by breaking false dependences that existing approaches cannot, and provides results which show that removing these dependences enables significant parallelization of a real application.

APRIL: a processor architecture for multiprocessing

The authors show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.

Interleaving: a multithreading technique targeting multiprocessors and workstations

It is shown that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments, and an alternative design is proposed that combines the best features of two existing approaches.