A Comparison of Compiler Tiling Algorithms

  title={A Comparison of Compiler Tiling Algorithms},
  author={Gabriel Rivera and Chau-Wen Tseng},
Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with existing techniques. Results show padding improves performance of matrix multiply by over 100% in some cases over a range of matrix sizes. Comparing the efficacy of different tiling algorithms, we… 

Code tiling for improving the cache performance of PDE solvers

A novel compiler technique called code tiling is presented for generating fast tiled codes for SOR-like PDE solvers on uniprocessors with a memory hierarchy that combines loop tiling with a new array layout transformation called data tiling in such a way that a significant amount of cache misses are eliminated.


This paper develops a new selection algorithm targeting relaxation codes that considers the effect of loop skewing, which is necessary to tile such codes, and achieves an average speedup of 1.27 to 1.63 over all the other algorithms.

Optimal skewed tiling for cache locality enhancement

  • Zhiyuan Li
  • Computer Science
    Proceedings International Parallel and Distributed Processing Symposium
  • 2003
This paper shows that, to optimally tile iterative stencil loops, the imperfectly nested inner loops must be realigned such that they can be minimally skewed across different time steps.

On the Interaction of Tiling and Automatic Parallelization

This paper presents an algorithm that applies tiling in concert with parallelization, and presents the first comprehensive evaluation of tiling techniques on compiler-parallelized programs.

A Stable and Efficient Loop Tiling Algorithm

A new tiling algorithms that performs better than previous algorithms in terms of execution time and stability, and generates code with a performance comparable to the best measured algorithm is developed.

A tile size selection analysis for blocked array layouts

It is proved that when applying optimization techniques, such as register assignment, array alignment, prefetching and loop unrolling, tile sizes equal to L1 capacity, offer better cache utilization, even for loop bodies that access more than just one array.

A Quantitative Analysis of Tile Size Selection Algorithms

A new tiling algorithm is developed that performs better than previous algorithms in terms of execution time and stability, and generates code with a performance comparable to the best measured algorithm.

Exploiting non-uniform reuse for cache optimization

This paper shows that the exploitation of non-uniform reuse can be worthwhile as well, and introduces two novel program restructuring techniques called fol ingandsnaking and study their performance impact on an exemplary loop nest.

Tile size selection revisited

This article proposes a new analytical model for tile size selection that leverages the high set associativity in modern caches to minimize conflict misses and considers the interaction of tiling with the SIMD unit in modern processors in estimating the optimal tile size.

Tiling Optimizations for 3D Scientific Computations

Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.



A compiler algorithm for optimizing locality in loop nests

An algorithm to optimize cache locality in scientific codes on uniprocessor and multiprocesser machines that considers loop and data layout transformations in a unified framework and can optimize nests for which optimization technique8 based on loop transformations alone are not succe88ful.

A data locality optimizing algorithm

An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

Eliminating conflict misses for high performance architectures

GROUPPAD, an inter-variable padding heuristic to preserve group reuse in stencil computations frequently found in scientific computations is presented and padding can also improve performance in parallel programs.

More iteration space tiling

  • M. Wolfe
  • Computer Science
    Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89)
  • 1989
Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages. Tiles become a natural candidate as the unit of work for parallel task scheduling.

Tile size selection using cache organization and data layout

This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache that eliminates both capacity and self-interference misses and reduces cross-Interference misses.

Data transformations for eliminating conflict misses

Experiments on arange of programs indicate PADLITE can eliminate conflicts for benchmarks, but PAD is more effective over a range of cache and problem sizes, with some SPEC95 programs improving up to 15%.

Improving data locality with loop transformations

This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.

Combining Loop Transformations Considering Caches and Scheduling

A model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead is presented and an algorithm to intelligently search through the various, possible transformations is developed, using the authors' machine model to select the set of transformations leading to the best overall performance.

Improving register allocation for subscripted variables

This paper presents a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers.

Cache miss equations: an analytical representation of cache misses

Methods for generating and solving Cache Miss equations that give a detailed representation of the cache misses in loop-oriented scientific code and provide a general framework to guide code optimizations for improving cache performance are described.