HPTT: a high-performance tensor transposition C++ library
@article{Springer2017HPTTAH, title={HPTT: a high-performance tensor transposition C++ library}, author={Paul L. Springer and Tong Su and Paolo Bientinesi}, journal={Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming}, year={2017} }
Recently we presented TTC, a domain-specific compiler for tensor transpositions. [] Key Method This modular designâinspired by BLISâmakes HPTT easy to port to different architectures, by only replacing the hand-vectorized micro-kernel (e.g., a 4à 4 transpose). HPTT also offers an optional autotuning frameworkâguided by performance heuristicsâthat explores a vast search space of implementations at runtime (similar to FFTW). Across a wide range of different tensor transpositions and architectures (e…
Figures and Tables from this paper
40 Citations
Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs
- Computer ScienceICS
- 2018
This paper develops an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose, and demonstrates significant improvement over the current state-of-the-art.
TTLG - An Efficient Tensor Transposition Library for GPUs
- Computer Science2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2018
This paper presents a Tensor Transposition Library for GPUs (TTLG). A distinguishing feature of TTLG is that it also includes a performance prediction model, which can be used by higher level…
An optimized tensor completion library for multiple GPUs
- Computer ScienceICS
- 2021
The first tensor completion library cuTC is developed on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+) and a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction.
Towards compositional and generative tensor optimizations
- Computer ScienceSPLASH 2017
- 2017
This paper proposes a generic and easily extensible intermediate language for expressing tensor computations and code transformations in a modular and generative fashion and offers meta-programming capabilities for experts in code optimization.
Enabling Distributed-Memory Tensor Completion in Python using New Sparse Tensor Kernels
- Computer ScienceArXiv
- 2019
A new multi-tensor routine, TTTP, is introduced that is asymptotically more efficient than pairwise tensor contraction for key components of the tensor completion methods.
Distributed-memory tensor completion for generalized loss functions in python using new sparse tensor kernels
- Computer ScienceJ. Parallel Distributed Comput.
- 2022
AutoHOOT: Automatic High-Order Optimization for Tensors
- Computer SciencePACT
- 2020
This work introduces AutoHOOT, the first automatic differentiation framework targeting at high-order optimization for tensor computations, which contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors' granularity and are easy to optimize.
A Code Generator for High-Performance Tensor Contractions on GPUs
- Computer Science2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
- 2019
A high-performance GPU code generator for arbitrary tensor contractions that exploits domain-specific properties about data reuse in tensorcontractions to devise an effective code generation schema and determine parameters for mapping of computation to threads and staging of data through the GPU memory hierarchy.
Analytical cache modeling and tilesize optimization for tensor contractions
- Computer ScienceSC
- 2019
This paper provides an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions and shows that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tenser contractions.
a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs
- Computer ScienceArXiv
- 2020
A mode-wise flexible Tucker decomposition algorithm is proposed to enable the switch of different solvers for the factor matrices and core tensor, and a machine-learning adaptive solver selector is applied to automatically cope with the variations of both the input data and the hardware.
References
SHOWING 1-10 OF 30 REFERENCES
TTC: a tensor transposition compiler for multiple architectures
- Computer ScienceARRAY@PLDI
- 2016
The results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly, and TTC's support for multiple leading dimensions makes it a suitable candidate for the generation of performance-critical packing functions that are at the core of the ubiquitous BLAS 3 routines.
Autotuning Tensor Transposition
- Computer Science2014 IEEE International Parallel & Distributed Processing Symposium Workshops
- 2014
This paper introduces a framework that uses static analysis and empirical autotuning to produce optimized parallel tensor transposition code for node architectures using a rule-based code generation and transformation system.
Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions
- Computer Science2013 IEEE 27th International Symposium on Parallel and Distributed Processing
- 2013
This work demonstrates performance of CC with single and double excitations on 8192 nodes of Blue Gene/Q and shows that CTF outperforms NWChem on Cray XE6 supercomputers for benchmarked systems.
An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
- Computer ScienceComput. Phys. Commun.
- 2015
Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors
- Computer Science
- 2013
This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition that allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency.
Combining analytical and empirical approaches in tuning matrix transposition
- Computer Science2006 International Conference on Parallel Architectures and Compilation Techniques (PACT)
- 2006
An integrated optimization framework is developed that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors.
Cache-efficient matrix transposition
- Computer ScienceProceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)
- 2000
This work investigates the memory system performance of several algorithms for transposing an N/spl times/N matrix in-place, where N is large and the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time.
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
- Computer ScienceACM Trans. Math. Softw.
- 2015
Preliminary performance of level-2 and level-3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
- Computer ScienceArXiv
- 2016
The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.