Optimized Unrolling of Nested Loops

@article{Sarkar2004OptimizedUO,
  title={Optimized Unrolling of Nested Loops},
  author={Vivek Sarkar},
  journal={International Journal of Parallel Programming},
  year={2004},
  volume={29},
  pages={545-581}
}
  • Vivek Sarkar
  • Published 2004
  • Computer Science
  • International Journal of Parallel Programming
Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code… Expand
Optimal loop unrolling for GPGPU programs
TLDR
This paper develops a semi-automatic, compile-time approach for identifying optimal unroll factors for suitable loops in GPGPU programs and proposes techniques for reducing the number of un roll factors evaluated, based on the characteristics of the program being compiled and the device being compiled to. Expand
Register tiling in nonrectangular iteration spaces
TLDR
A new general algorithm to perform multidimensional tiling for the register level in both rectangular and nonrectangular iteration spaces is presented and a simple heuristic to determine the tiling parameters is proposed. Expand
Tools for Performance Optimizations and Tuning of Affine Loop Nests
TLDR
This dissertation develops systematic solutions to parameterized multi-level tiling of arbitrary imperfectly nest d affine loops for both sequential and parallel execution and develops approaches to mapping in dependent tiles to different processing units using compile-time and run-time sched ules. Expand
Using Performance Bounds to Guide Pre-scheduling Code Optimizations
We advocate using performance bounds to guide code optimizations. Accurate performance bounds establish an efficient way to evaluate benefits as well as overheads of code transformations withoutExpand
Re-selection for Iterative Modulo Scheduling on High Performance Muti-issue DSPs
TLDR
This work proposes a technique that efficiently reselects instructions of an application loop code considering dependence complexity, which directly resolve the dependence constraint and uses a heuristic to efficiently handle this problem in pre-stage of iterative modulo scheduling without loop unrolling. Expand
Using performance bounds to guide code compilation and processor design
TLDR
This thesis introduces a novel bound-guided approach to systematically regulate code-size related instruction level parallelism (ILP) optimizations including tail duplication, loop unrolling, and if-conversion and develops a heuristic to achieve this optimal tradeoff. Expand
UFS: a global trade‐off strategy for loop unrolling for VLIW architectures
TLDR
This paper proposes a novel method based on Integer Linear Programming for computing efficient unroll factors for collections of loop nests with control over code size and side‐effects of the transformation with excellent trade‐offs. Expand
Combined ILP and Register Tiling: Analytical Model and Optimization Framework
TLDR
This work develops an analytical model of the combined problem of optimal ILP and register reuse, and forms a mathematical optimization problem that chooses the parameters of the ILP-exposing transformation and register tiling so as to minimize the total execution time. Expand
Improving register allocation for subscripted variables
TLDR
This paper presents a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. Expand
On Optimizing the Longest Common Subsequence Problem by Loop Unrolling Along Wavefronts
  • Johann Steinbrecher, W. Shang
  • Computer Science
  • 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
  • 2012
TLDR
This paper characterizes loop unrolling by the unroll factor, the number of iterations in a super iteration and the unrolling direction, the choice of iterations to be grouped to form the super iteration. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Aggressive Loop Unrolling in a Retargetable Optimizing Compiler
TLDR
This paper describes how aggressive loop unrolling is done in a retargetable optimizing compiler and shows that aggressiveloop unrolling can yield additional performance increase of 10 to 20 percent over the simple, naive approaches employed by many production compilers. Expand
Unrolling-based optimizations for modulo scheduling
TLDR
The benefits of unrolling and a set of optimizations for unrolled loops which have been implemented in the IMPACT compiler are described and five of the SPEC CFP92 programs are reported. Expand
Unroll-and-jam using uniformly generated sets
  • S. Carr, Yiping Guan
  • Computer Science
  • Proceedings of 30th Annual International Symposium on Microarchitecture
  • 1997
TLDR
This paper presents an algorithm that uses a linear-algebra-based technique to compute unroll amounts, which results in an 84% reduction over dependence-based techniques in the total number of dependences needed in the benchmark suite. Expand
A study of scalar compilation techniques for pipelined supercomputers
TLDR
It is shown that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY -1S with a vector unit and the CFT vectorizing compiler. Expand
Compiler transformations for high-performance computing
TLDR
This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran, and describes the purpose of each transformation, how to determine if it is legal, and an example of its application. Expand
Improving the ratio of memory operations to floating-point operations in loops
TLDR
This paper develops and evaluates techniques that automatically restructure program loops to achieve high performance on specific target architectures and attempts to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. Expand
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers
TLDR
This paper describes how the transformer component of the ASTI optimizer automatically selects high-order transformations for a given input program and a target uniprocessor, so as to improve utilization of the memory hierarchy and instruction-level parallelism. Expand
Improving register allocation for subscripted variables
TLDR
This paper presents a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers. Expand
Memory bandwidth optimizations for wide-bus machines
TLDR
The authors describe and evaluate the effectiveness of some code improvement techniques that are designed to take advantage of wide-bus machines, and show that, for many memory-insensitive algorithms, it is possible to reduce the number of memory loads and stores by 30 to 40%. Expand
Software methods for improvement of cache performance on supercomputer applications
TLDR
Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures. Expand
...
1
2
3
4
...