A data locality optimizing algorithm

@inproceedings{Wolf1991ADL,
  title={A data locality optimizing algorithm},
  author={Michael E. Wolf and Monica S. Lam},
  booktitle={PLDI '91},
  year={1991}
}
This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix transformations.The algorithm has been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes… 
Tiling Optimizations for 3D Scientific Computations
TLDR
Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
MIST: an algorithm for memory miss traffic management
  • P. Grun, N. Dutt, A. Nicolau
  • Computer Science
    IEEE/ACM International Conference on Computer Aided Design. ICCAD - 2000. IEEE/ACM Digest of Technical Papers (Cat. No.00CH37140)
  • 2000
TLDR
This paper presents a memory-aware compiler technique that actively manages cache misses, and performs global miss traffic optimizations, to better hide the latency of the memory operations.
Data relation vectors: a new abstraction for data optimizations
  • M. Kandemir, J. Ramanujam
  • Computer Science
    Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622)
  • 2000
TLDR
The data relation vector abstraction has been implemented in the SUIF compilation framework and has been tested using a set of twelve benchmarks from image processing and scientific computation domains, and preliminary results on a super-scalar processor show that it is successful in reducing compilation time and outperforms two previously proposed techniques.
Quantifying loop nest locality using SPEC'95 and the perfect benchmarks
TLDR
It is found that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal, which goes against the commonly held assumption that spatial reuse dominates.
A matrix-based approach to the global locality optimization problem
TLDR
This paper argues for a combined approach which employs both loop and data transformations and shows that this process can be put in a simple matrix framework which can be manipulated by an optimizing compiler.
Static locality analysis for cache management
TLDR
It is shown that previous proposals on locality analysis are not appropriate when the proposals have a high conflict miss ratio, and a compile-time interference analysis is introduced that significantly improve the performance of them.
A quantitative analysis of loop nest locality
TLDR
The Perfect Benchmarks are used to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties and show that several popular assertions are at best overstatements.
The cache performance and optimizations of blocked algorithms
TLDR
It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication
TLDR
COSMA is a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes, and outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios.
A compiler-based approach for dynamically managing scratch-pad memories in embedded systems
TLDR
A compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that includes an optimization suite that uses loop and data transformations, an on- chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on- Chip SPM.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
The cache performance and optimizations of blocked algorithms
TLDR
It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Improving register allocation for subscripted variables
TLDR
This paper presents a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers.
Matrix computations
Dependence analysis for supercomputing
  • U. Banerjee
  • Mathematics
    The Kluwer international series in engineering and computer science
  • 1988
TLDR
This chapter discusses one-Dimensional Arrays, Single Loops, and Dependence Tests, which are tests of the theory of Dependence on Vectors and its application to Systems of Equations.
Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design
TLDR
A methodology is proposed that facilitates analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a function of certain system parameters to identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration.
Improving the performance of virtual memory computers.
Software methods for improvement of cache performance on supercomputer applications
TLDR
Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
TLDR
The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest and it is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fullypermutable loop nests and wavefronting the fully permutable nests.
...
...