HPTT: a high-performance tensor transposition C++ library

@article{Springer2017HPTTAH,
  title={HPTT: a high-performance tensor transposition C++ library},
  author={Paul L. Springer and Tong Su and Paolo Bientinesi},
  journal={Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming},
  year={2017}
}
  • P. SpringerTong SuP. Bientinesi
  • Published 14 April 2017
  • Computer Science
  • Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming
Recently we presented TTC, a domain-specific compiler for tensor transpositions. [] Key Method This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4× 4 transpose). HPTT also offers an optional autotuning framework—guided by performance heuristics—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e…

Figures and Tables from this paper

TTC: A high-performance Compiler for Tensor Transpositions

By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large.

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

GETT is a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM), and exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory.

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

This paper develops an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose, and demonstrates significant improvement over the current state-of-the-art.

TTLG - An Efficient Tensor Transposition Library for GPUs

This paper presents a Tensor Transposition Library for GPUs (TTLG). A distinguishing feature of TTLG is that it also includes a performance prediction model, which can be used by higher level

An optimized tensor completion library for multiple GPUs

The first tensor completion library cuTC is developed on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+) and a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction.

Towards compositional and generative tensor optimizations

This paper proposes a generic and easily extensible intermediate language for expressing tensor computations and code transformations in a modular and generative fashion and offers meta-programming capabilities for experts in code optimization.

Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels

A new multi-tensor routine, TTTP, is introduced that is asymptotically more efficient than pairwise tensor contraction for key components of the tensor completion methods.

AutoHOOT: Automatic High-Order Optimization for Tensors

This work introduces AutoHOOT, the first automatic differentiation framework targeting at high-order optimization for tensor computations, which contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize.

A Code Generator for High-Performance Tensor Contractions on GPUs

A high-performance GPU code generator for arbitrary tensor contractions that exploits domain-specific properties about data reuse in tensorcontractions to devise an effective code generation schema and determine parameters for mapping of computation to threads and staging of data through the GPU memory hierarchy.

References

SHOWING 1-10 OF 30 REFERENCES

TTC: A high-performance Compiler for Tensor Transpositions

By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large.

TTC: a tensor transposition compiler for multiple architectures

The results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly, and TTC's support for multiple leading dimensions makes it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

A heuristic scheme for choosing the optimal parameters for tensor transpose algorithms by implementing an analytical GPU performance model that can be used at runtime without need for performance measurements or profiling is developed.

Design of a High-Performance GEMM-like Tensor–Tensor Multiplication

GETT is a novel approach for dense tensor contractions that mirrors the design of a high-performance general matrix–matrix multiplication (GEMM), and exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory.

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

This work demonstrates performance of CC with single and double excitations on 8192 nodes of Blue Gene/Q and shows that CTF outperforms NWChem on Cray XE6 supercomputers for benchmarked systems.

Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition that allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency.

Combining analytical and empirical approaches in tuning matrix transposition

An integrated optimization framework is developed that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors.

Cache-efficient matrix transposition

  • S. ChatterjeeSandeep Sen
  • Computer Science
    Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)
  • 2000
This work investigates the memory system performance of several algorithms for transposing an N/spl times/N matrix in-place, where N is large and the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time.

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).