Automatic code generation for many-body electronic structure methods: the tensor contraction engine

@article{Auer2006AutomaticCG,
  title={Automatic code generation for many-body electronic structure methods: the tensor contraction engine},
  author={Alexander A. Auer and Gerald Baumgartner and David E. Bernholdt and Alina Bibireata and Venkatesh Choppella and Daniel Cociorva and Xiaoyang Gao and Robert J. Harrison and Sriram Krishnamoorthy and Sandhya Krishnan and Chi-Chung Lam and Qingda Lu and Marcel Nooijen and Russell M. Pitzer and J. Ramanujam and P. Sadayappan and Alexander Sibiryakov},
  journal={Molecular Physics},
  year={2006},
  volume={104},
  pages={211 - 228}
}
As both electronic structure methods and the computers on which they are run become increasingly complex, the task of producing robust, reliable, high-performance implementations of methods at a rapid pace becomes increasingly daunting. In this paper we present an overview of the Tensor Contraction Engine (TCE), a unique effort to address issues of both productivity and performance through automatic code generation. The TCE is designed to take equations for many-body methods in a convenient… 
Generating Efficient Quantum Chemistry Codes for Novel Architectures.
TLDR
It is suggested that the meta-programming and empirical performance optimization approach may be important in future computational chemistry applications, especially in the face of quickly evolving computer architectures.
A Code Generator for High-Performance Tensor Contractions on GPUs
TLDR
A high-performance GPU code generator for arbitrary tensor contractions that exploits domain-specific properties about data reuse in tensorcontractions to devise an effective code generation schema and determine parameters for mapping of computation to threads and staging of data through the GPU memory hierarchy.
A case study in mechanically deriving dense linear algebra code
TLDR
This paper uses DxT to derive the implementation of a representative matrix operation, two- sided Trmm, using a knowledge base of transformations that were encoded for a simpler set of operations, the level-3 BLAS, and adding only a few transformations to accommodate the more complex two-sided Trmm.
Format abstraction for sparse tensor algebra compilers
TLDR
An interface that describes formats in terms of their capabilities and properties is developed, and a modular code generator design makes it simple to add support for new tensor formats, and the performance of the generated code is competitive with hand-optimized implementations.
Taco: A tool to generate tensor algebra kernels
TLDR
Tensor algebra is an important computational abstraction that is increasingly used in data analytics, machine learning, engineering, and the physical sciences and to support programmers the authors have developed taco, a code generation tool that generates dense, sparse, and mixed kernels from tensor algebra expressions.
Expression Tree Evaluation by Dynamic Code Generation - Are Accelerators Up for the Task?
TLDR
The need that coming HPC systems still have to be equipped with a significant portion of latency-oriented, thus complex general-purpose hardware is seen, and the benefit of accelerators for this scenario is researched.
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
TLDR
This paper presents the approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU, and provides several effective optimization algorithms.
The tensor algebra compiler
TLDR
The first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors is introduced, which is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.
AutoHOOT: Automatic High-Order Optimization for Tensors
TLDR
This work introduces AutoHOOT, the first automatic differentiation framework targeting at high-order optimization for tensor computations, which contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize.
Generatively Programming Galerkin Projections on General Purpose Graphics Processing Units
TLDR
A performance improvement of almost an order of magnitude over a multicore CPU implementation for the Advection-Diffusion equation on typical hardware performing computations using double-precision arithmetic is demonstrated.
...
...

References

SHOWING 1-10 OF 199 REFERENCES
Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization
TLDR
This paper provides an overview of a planned synthesis system that will take as input a high-level specification of the computation and generate high-performance parallel code for a number of target architectures.
Space-time trade-off optimization for a class of electronic structure calculations
TLDR
An algorithm is presented that starts with an operation-minimal form of the computation and systematically explores the possible space-time trade-offs to identify the form with lowest cost that fits within a specified memory limit.
Raising the Level of Programming Abstraction in Scalable Programming Models
TLDR
This paper presents two distinctly different approaches to raising the level of abstraction of the programming model while maintaining or increasing performance: the Tensor Contraction engine, a narrowly-focused domain specific language together with an optimizing compiler; and Extended Global Arrays, a programming framework that integrates programming models dealing with different layers of the memory/storage hierarchy using compiler analysis and code transformation techniques.
Memory-Constrained Data Locality Optimization for Tensor Contractions
TLDR
An optimization framework to search among a space of fusion and tiling choices to minimize the data movement overhead is developed and is demonstrated on a computation representative of a component used in quantum chemistry suites.
The automated solution of second quantization equations with applications to the coupled cluster approach
TLDR
In this research a program has been written in the C programming language which can efficiently compute the quasivacuum expectation value of a product of creation and annihilation operators and scalar arrays and which has been applied to open-shell coupled cluster theory.
Memory-Constrained Communication Minimization for a Class of Array Computations
TLDR
An approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit is developed.
Data Locality Optimization for Synthesis of Efficient Out-of-Core Algorithms
TLDR
This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested loops that represent tensor contraction computations that combines loop fusion with loop tiling and uses a performance-model driven approach toloop tiling for the generation of out- of-corecode.
On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution
TLDR
This paper addresses the compile-time optimization of a form of nested-loop computation that is motivated by a computational physics application and a pruning search strategy for determination of an optimal form is developed.
Global arrays: A nonuniform memory access programming model for high-performance computers
TLDR
The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes.
Loop optimization for a class of memory-constrained computations
TLDR
This paper develops an integrated model combining loop tiling for enhancing data reuse, and loop fusion for reduction of memory for intermediate temporary arrays, with the objective of minimizing cache misses while keeping the total memory usage within a given limit.
...
...