Memory-Constrained Data Locality Optimization for Tensor Contractions

@inproceedings{Bibireata2003MemoryConstrainedDL,
  title={Memory-Constrained Data Locality Optimization for Tensor Contractions},
  author={Alina Bibireata and Sandhya Krishnan and Gerald Baumgartner and Daniel Cociorva and Chi-Chung Lam and P. Sadayappan and J. Ramanujam and David E. Bernholdt and Venkatesh Choppella},
  booktitle={LCPC},
  year={2003}
}
The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions over large multi-dimensional arrays. Efficient computation of these contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, requiring their storage on disk. However, the intermediates can often be generated and used in batches through appropriate loop fusion transformations. To optimize the… 
Model-driven search-based loop fusion optimization for handwritten code
TLDR
This thesis shows how to apply the loop fusion algorithm to handwritten code in a procedural language and outlines how the constraints on loop bounds expressions and array index expressions could be removed in the future using an algebraic cost model and an analysis of the iteration space using a polyhedral model.
MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation
TLDR
This article proposes MemHC, an optimized systematic GPU memory management framework that aims to accelerate the calculation of many-body correlation functions utilizing a series of new memory reduction designs.
Automatic code generation for many-body electronic structure methods: the tensor contraction engine
TLDR
An overview of the Tensor Contraction Engine (TCE), a unique effort to address issues of both productivity and performance through automatic code generation that acts like an optimizing compiler.
Automatic transformation and optimization of applications on gpus and gpu clusters
TLDR
An auto-tuning framework which selects algorithms and parameters according to some cost model and thresholds extracted from simple micro-benchmarks is developed, and a loop transformation system in the environment of multi-level memory hierarchy is developed.
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation
TLDR
A storage efficient, fast and parallelizable out-of-core framework for streaming computations of high resolution level sets which allows for the combination of interface propagation, re-normalization and narrow-band rebuild into a single pass over the data stored on disk.
Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models
TLDR
This paper provides an overview of a program synthesis system for a class of quantum chemistry computations, expressible as a set of tensor contractions and arise in electronic structure modeling.
A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry
This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as
Symbolic Algebra in Quantum Chemistry
TLDR
New algorithms that automate the algebraic transformation and computer implementation of many-body quantum-mechanical methods for electron correlation enable a whole new class of highly complex but vastly accurate methods.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
Global communication optimization for tensor contraction expressions under memory constraints
TLDR
An approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit is developed.
Loop optimization for a class of memory-constrained computations
TLDR
This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays, with the objective of minimizing cache misses while keeping the total memory usage within a given limit.
Space-time trade-off optimization for a class of electronic structure calculations
TLDR
An algorithm is presented that starts with an operation-minimal form of the computation and systematically explores the possible space-time trade-offs to identify the form with lowest cost that fits within a specified memory limit.
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
TLDR
This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures.
Data Locality Optimization for Synthesis of Efficient Out-of-Core Algorithms
TLDR
This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested loops that represent tensor contraction computations that combines loop fusion with loop tiling and uses a performance-model driven approach toloop tiling for the generation of out- of-corecode.
On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution
TLDR
This paper addresses the compile-time optimization of a form of nested-loop computation that is motivated by a computational physics application and a pruning search strategy for determination of an optimal form is developed.
Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing Multi-Dimensiona l Integrals
TLDR
This paper proposes algorithms for finding loop fusion configurations that minimize memory usage under static and dynamic memory allocation models, and suggests ways to further reduce memory usage, when necessary, at the cost of increased arithmetic operations.
Performance optimization of a class of loops implementing multidimensional integrals
TLDR
This thesis addresses the performance optimization of a class of loops that implement multi-dimensional summations and enhances the solutions to the various optimization problems to address the practically significant issues of sparsity, use of fast Fourier transforms, and utilization of common sub-expressions.
Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals
TLDR
This paper proposes an algorithm for finding a loop fusion configuration that minimizes memory usage and shows the performance improvement obtained by the algorithm on an electronic structure computation.
Optimization of a Class of Multi-Dimensional Integrals on Parallel Machines
TLDR
A framework for optimization of computational cost and communication cost has been developed, that can be used to synthesize efficient code.
...
...