HPTT: a high-performance tensor transposition C++ library

  title={HPTT: a high-performance tensor transposition C++ library},
  author={Paul L. Springer and Tong Su and Paolo Bientinesi},
  journal={Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming},
  • P. SpringerTong SuP. Bientinesi
  • Published 14 April 2017
  • Computer Science
  • Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming
Recently we presented TTC, a domain-specific compiler for tensor transpositions. [] Key Method This modular design—inspired by BLIS—makes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4× 4 transpose). HPTT also offers an optional autotuning framework—guided by performance heuristics—that explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e…

Figures and Tables from this paper

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

This paper develops an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose, and demonstrates significant improvement over the current state-of-the-art.

TTLG - An Efficient Tensor Transposition Library for GPUs

This paper presents a Tensor Transposition Library for GPUs (TTLG). A distinguishing feature of TTLG is that it also includes a performance prediction model, which can be used by higher level

An optimized tensor completion library for multiple GPUs

The first tensor completion library cuTC is developed on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+) and a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction.

Towards compositional and generative tensor optimizations

This paper proposes a generic and easily extensible intermediate language for expressing tensor computations and code transformations in a modular and generative fashion and offers meta-programming capabilities for experts in code optimization.

Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels

A new multi-tensor routine, TTTP, is introduced that is asymptotically more efficient than pairwise tensor contraction for key components of the tensor completion methods.

AutoHOOT: Automatic High-Order Optimization for Tensors

This work introduces AutoHOOT, the first automatic differentiation framework targeting at high-order optimization for tensor computations, which contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize.

A Code Generator for High-Performance Tensor Contractions on GPUs

A high-performance GPU code generator for arbitrary tensor contractions that exploits domain-specific properties about data reuse in tensorcontractions to devise an effective code generation schema and determine parameters for mapping of computation to threads and staging of data through the GPU memory hierarchy.

Analytical cache modeling and tilesize optimization for tensor contractions

This paper provides an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions and shows that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tenser contractions.

a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

A mode-wise flexible Tucker decomposition algorithm is proposed to enable the switch of different solvers for the factor matrices and core tensor, and a machine-learning adaptive solver selector is applied to automatically cope with the variations of both the input data and the hardware.



TTC: a tensor transposition compiler for multiple architectures

The results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly, and TTC's support for multiple leading dimensions makes it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.

Autotuning Tensor Transposition

  • Lai WeiJ. Mellor-Crummey
  • Computer Science
    2014 IEEE International Parallel & Distributed Processing Symposium Workshops
  • 2014
This paper introduces a framework that uses static analysis and empirical autotuning to produce optimized parallel tensor transposition code for node architectures using a rule-based code generation and transformation system.

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

This work demonstrates performance of CC with single and double excitations on 8192 nodes of Blue Gene/Q and shows that CTF outperforms NWChem on Cray XE6 supercomputers for benchmarked systems.

Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition that allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency.

Combining analytical and empirical approaches in tuning matrix transposition

An integrated optimization framework is developed that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors.

Cache-efficient matrix transposition

  • S. ChatterjeeSandeep Sen
  • Computer Science
    Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)
  • 2000
This work investigates the memory system performance of several algorithms for transposing an N/spl times/N matrix in-place, where N is large and the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time.

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

A fast algorithm for transposing large multidimensional image data sets