# HPTT: a high-performance tensor transposition C++ library

@article{Springer2017HPTTAH, title={HPTT: a high-performance tensor transposition C++ library}, author={Paul L. Springer and Tong Su and Paolo Bientinesi}, journal={Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming}, year={2017} }

Recently we presented TTC, a domain-specific compiler for tensor transpositions. [] Key Method This modular designâinspired by BLISâmakes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g.,Â a 4Ã 4 transpose). HPTT also offers an optional autotuning frameworkâguided by performance heuristicsâthat explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e…

## 39 Citations

### TTC: A high-performance Compiler for Tensor Transpositions

- Computer ScienceACM Trans. Math. Softw.
- 2017

By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large.

### Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

- Computer ScienceACM Trans. Math. Softw.
- 2018

GETT is a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM), and exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory.

### Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

- Computer ScienceICS
- 2018

This paper develops an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose, and demonstrates significant improvement over the current state-of-the-art.

### TTLG - An Efficient Tensor Transposition Library for GPUs

- Computer Science2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2018

This paper presents a Tensor Transposition Library for GPUs (TTLG). A distinguishing feature of TTLG is that it also includes a performance prediction model, which can be used by higher level…

### An optimized tensor completion library for multiple GPUs

- Computer ScienceICS
- 2021

The first tensor completion library cuTC is developed on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+) and a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction.

### Towards compositional and generative tensor optimizations

- Computer ScienceSPLASH 2017
- 2017

This paper proposes a generic and easily extensible intermediate language for expressing tensor computations and code transformations in a modular and generative fashion and offers meta-programming capabilities for experts in code optimization.

### Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels

- Computer ScienceArXiv
- 2019

A new multi-tensor routine, TTTP, is introduced that is asymptotically more efficient than pairwise tensor contraction for key components of the tensor completion methods.

### Distributed-memory tensor completion for generalized loss functions in python using new sparse tensor kernels

- Computer ScienceJ. Parallel Distributed Comput.
- 2022

### AutoHOOT: Automatic High-Order Optimization for Tensors

- Computer SciencePACT
- 2020

This work introduces AutoHOOT, the first automatic differentiation framework targeting at high-order optimization for tensor computations, which contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize.

### A Code Generator for High-Performance Tensor Contractions on GPUs

- Computer Science2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
- 2019

A high-performance GPU code generator for arbitrary tensor contractions that exploits domain-specific properties about data reuse in tensorcontractions to devise an effective code generation schema and determine parameters for mapping of computation to threads and staging of data through the GPU memory hierarchy.

## References

SHOWING 1-10 OF 30 REFERENCES

### TTC: A high-performance Compiler for Tensor Transpositions

- Computer ScienceACM Trans. Math. Softw.
- 2017

By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large.

### TTC: a tensor transposition compiler for multiple architectures

- Computer ScienceARRAY@PLDI
- 2016

The results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly, and TTC's support for multiple leading dimensions makes it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

### cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

- Computer ScienceArXiv
- 2017

A heuristic scheme for choosing the optimal parameters for tensor transpose algorithms by implementing an analytical GPU performance model that can be used at runtime without need for performance measurements or profiling is developed.

### Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

- Computer ScienceACM Trans. Math. Softw.
- 2018

GETT is a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM), and exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory.

### Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

- Computer Science2013 IEEE 27th International Symposium on Parallel and Distributed Processing
- 2013

This work demonstrates performance of CC with single and double excitations on 8192 nodes of Blue Gene/Q and shows that CTF outperforms NWChem on Cray XE6 supercomputers for benchmarked systems.

### An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

- Computer ScienceComput. Phys. Commun.
- 2015

### Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

- Computer Science
- 2013

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition that allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency.

### Combining analytical and empirical approaches in tuning matrix transposition

- Computer Science2006 International Conference on Parallel Architectures and Compilation Techniques (PACT)
- 2006

An integrated optimization framework is developed that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors.

### Cache-efficient matrix transposition

- Computer ScienceProceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)
- 2000

This work investigates the memory system performance of several algorithms for transposing an N/spl times/N matrix in-place, where N is large and the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time.

### BLIS: A Framework for Rapidly Instantiating BLAS Functionality

- Computer ScienceACM Trans. Math. Softw.
- 2015

Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).