Integrated compiler optimizations for tensor contractions
@inproceedings{Sadayappan2008IntegratedCO, title={Integrated compiler optimizations for tensor contractions}, author={P. Sadayappan and Xiaoyang Gao}, year={2008} }
This dissertation addresses several performance optimization issues in the context of the Tensor Contraction Engine (TCE), a domain-specific compiler to synthesize parallel, out-of-core programs for a class of scientific computations encountered in computational chemistry and physics. The domain of our focus is electronic structure calculations, where many computationally intensive components are expressible as a set of tensor contractions. These scientific applications are extremely compute…
Figures and Tables from this paper
figure 1.1 figure 1.3 figure 2.1 figure 2.2 table 3.1 figure 3.2 table 3.2 figure 3.3 table 3.3 table 3.4 figure 3.5 table 3.5 table 3.6 figure 3.7 figure 3.8 table 4.1 table 4.2 figure 4.3 table 4.3 table 4.4 table 4.5 table 4.6 table 4.7 figure 5.1 table 5.1 figure 5.2 figure 5.3 figure 5.4 figure 5.5 figure 5.6 figure 6.1 table 6.1 table 6.2 figure 6.4 figure 6.5 figure 7.10 figure 7.11 figure 7.12 figure 7.2 figure 7.3 figure 7.5 figure 7.6 figure 7.7 figure 7.9
One Citation
Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions
- Computer ScienceJ. Parallel Distributed Comput.
- 2012
References
SHOWING 1-10 OF 50 REFERENCES
Global communication optimization for tensor contraction expressions under memory constraints
- Computer ScienceProceedings International Parallel and Distributed Processing Symposium
- 2003
An approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit is developed.
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
- Computer ScienceHiPC
- 2001
This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures.
Complier Techniques for Efficient Parallelization of Out-of-Core Tensor Contractions
- Computer Science
- 2005
A performance model for tensor contractions is developed, considering both disk I/O as well as inter-processor communication costs, to facilitate performance-model driven loop optimization for this domain.
Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions
- Computer ScienceACM/IEEE SC 2005 Conference (SC'05)
- 2005
Novel pruning strategies are developed whereby a search problem in a larger space is replaced by a large number of searches in a much smaller space, to determine the optimal permutation, fusion, tiling and placement of disk I/O statements.
Data Locality Optimization for Synthesis of Efficient Out-of-Core Algorithms
- Computer ScienceHiPC
- 2003
This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested loops that represent tensor contraction computations that combines loop fusion with loop tiling and uses a performance-model driven approach toloop tiling for the generation of out- of-corecode.
Compilation Techniques for Out-of-Core Parallel Computations
- Computer ScienceParallel Comput.
- 1998
Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations
- Computer ScienceInternational Conference on Computational Science
- 2005
This paper develops an effective heuristic approach to the operation minimization problem, and demonstrates its effectiveness on tensor contraction expressions for coupled cluster equations.
Loop optimization for a class of memory-constrained computations
- Computer ScienceICS '01
- 2001
This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays, with the objective of minimizing cache misses while keeping the total memory usage within a given limit.
Compiler support for out-of-core arrays on parallel machines
- Computer ScienceProceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation
- 1995
In general, the compiler techniques attempt to choreograph I/O for an application based on high-level programmer annotations similar to Fortran D's DECOMPOSITION, ALIGN, and DISTRIBUTE statements.
A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations
- Computer ScienceIEEE Trans. Parallel Distributed Syst.
- 2000
A unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead, and extending the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests.