Compiler-based prefetching for recursive data structures
@inproceedings{Luk1996CompilerbasedPF, title={Compiler-based prefetching for recursive data structures}, author={C. K. Luk and Todd C. Mowry}, booktitle={ASPLOS VII}, year={1996} }
Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications has remained largely unexplored. This paper investigates compiler-based prefetching for pointer-based applications---in particular, those containing recursive data structures. We identify the fundamental…
Figures and Tables from this paper
422 Citations
Automatic Compiler-Inserted Prefetching for Pointer-Based Applications
- Computer ScienceIEEE Trans. Computers
- 1999
The scope of automatic compiler-inserted prefetching is expanded to also include the recursive data structures commonly found in pointer-based applications, and the most widely applicable scheme (greedyPrefetching) is proposed and automated in an optimizing research compiler.
Profile-guided post-link stride prefetching
- Computer ScienceICS '02
- 2002
This study uses profiling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler, and uses the strides discovered to insert prefetches into the executable directly, without the need for re-compilation.
Library-based Prefetching for Pointer-intensive Applications
- Computer Science
- 2006
A novel softwarePrefetching scheme for pointer-based datastructures in which prefetching is performed by a helper thread included in the data-structure library code in which the user application is not modified at all, and the benefits are robust across a range of memory system and application parameters without the need for recompilation.
A stateless, content-directed data prefetching mechanism
- Computer ScienceASPLOS X
- 2002
Content-Directed Data Prefetching is proposed, a data prefetching architecture that exploits the memory allocation used by operating systems and runtime systems to improve the performance of pointer-intensive applications constructed using modern language systems.
Bandwidth-Based Prefetching for Constant-Stride Arrays
- Computer Science
- 2004
A new algorithm for prefetching arrays that are accessed with a compile-time known constant stride is described, which generates prefetches that are more efficient than the standard algorithm because it avoids cache conflicts and issuesPrefetches based on the machine’s ability to process memory transactions in parallel.
Runahead execution vs. conventional data prefetching in the IBM POWER6 microprocessor
- Computer Science2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS)
- 2010
It is found that the POWER6 implementation of runahead prefetching is quite effective on many of the memory intensive applications studied; in isolation it improves performance as much as 36% and on average 10%.
Software caching vs. prefetching
- Computer ScienceISMM '02
- 2002
This paper investigates the technique of software caching for applications that perform searches or sorted insertions and finds that for applications involving a search, software caching performs as high as 30% better than the original application.
Tolerating latency in multiprocessors through compiler-inserted prefetching
- Computer ScienceTOCS
- 1998
The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses, and can improve the speed of some parallel applications by as much as a factor of two.
Data Prefetching for Non-Linear Memory References
- Computer ScienceHPCN Europe
- 1998
A new data prefetch scheme, called the Reference Value Prediction Caching (RVPC), is proposed in this paper, and it was showed that significant reduction in memory latency can be expected from the RVPC scheme, especially those applications with pointers.
A Practical Stride Prefetching Implementation in Global Optimizer
- Computer Science
- 2008
A new inductive data prefetching algorithm implemented in the global optimizer based on demand driven speculative recognition of inductive expressions equals to strongly connected component detection in data flow graph, thus eliminating the need to invoke the loop nest optimizer.
References
SHOWING 1-10 OF 31 REFERENCES
Design and evaluation of a compiler algorithm for prefetching
- Computer ScienceASPLOS V
- 1992
This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching
- Computer ScienceProceedings 22nd Annual International Symposium on Computer Architecture
- 1995
This paper presents a new prefetching scheme that, while usable by regular applications, is specifically targeted to irregular ones: memory binding and groupPrefetching, to hardware-bind and prefetch together groups of data that the programmer suggests are strongly related to each other.
Tolerating latency through software-controlled data prefetching
- Computer Science
- 1994
This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code that attempts to minimize overheads by only issuing prefetched for references that are predicted to suffer cache misses, and investigates the architectural support necessary to make prefetching effective.
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching
- Computer ScienceMICRO 24
- 1991
This paper examines alternative data access microarchitectures that effectively support compilerassisted data prefetching in superscalar processors and shows that a small data cache with compiler-assisted data preferences can achieve a performance level close to that of an ideal cache.
Compiler optimizations for improving data locality
- Computer ScienceASPLOS VI
- 1994
This paper presents compiler optimizations to improve data locality based on a simple yet accurate cost model and demonstrates that these program transformations are useful for optimizing many programs.
Cache miss heuristics and preloading techniques for general-purpose programs
- Computer ScienceMICRO 1995
- 1995
This paper presents a latency-hiding compiler technique that is applicable to general-purpose C programs that 'preloads' the data that are likely to cause a cache-miss before they are used, and thereby hiding the cache miss latency.
Software prefetching
- Computer ScienceASPLOS IV
- 1991
These simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.
A general data dependence test for dynamic, pointer-based data structures
- Computer SciencePLDI '94
- 1994
This paper presents a new technique for performing more accurate data dependence testing in the presence of dynamic, pointer-based data structures, and demonstrates its effectiveness by breaking false dependences that existing approaches cannot, and provides results which show that removing these dependences enables significant parallelization of a real application.
APRIL: a processor architecture for multiprocessing
- Computer Science[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture
- 1990
The authors show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.
Interleaving: a multithreading technique targeting multiprocessors and workstations
- Computer ScienceASPLOS VI
- 1994
It is shown that while current multiple-context designs work reasonably well for multiprocessors, they are ineffective in hiding the much shorter uniprocessor latencies using the limited parallelism found in workstation environments, and an alternative design is proposed that combines the best features of two existing approaches.