Parallel Nonnegative CP Decomposition of Dense Tensors

  title={Parallel Nonnegative CP Decomposition of Dense Tensors},
  author={Grey Ballard and Koby Hayashi and Ramakrishnan Kannan},
  journal={2018 IEEE 25th International Conference on High Performance Computing (HiPC)},
The CP tensor decomposition is a low-rank approximation of a tensor. [] Key Method The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional…

Sparsity-Aware Tensor Decomposition

This paper considers a design space that covers whether the partial MTTKRP results should be saved, different mode permutations and model the total volume of data movement to/from memory, and proposes a fine-grained load balancing method that supports higher levels of parallelization.

Efficient parallel CP decomposition with pairwise perturbation and multi-sweep dimension tree

This paper introduces the multi-sweep dimension tree (MSDT) algorithm, which requires the contraction between an order N input tensor and the first-contracted input matrix once every $(N-1)/N$ sweeps, and introduces a more communication-efficient approach to parallelizing an approximate CP-ALS algorithm, pairwise perturbation.

PLANC: Parallel Low Rank Approximation with Non-negativity Constraints

This work proposes a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes and performing efficient and scalable parallel algorithms to compute the low-rank approximation.

On optimizing distributed non-negative Tucker decomposition

This work develops three algorithms for efficiently executing the non-negative Tucker Decomposition procedure and presents a distributed implementation of NTD for sparse tensors that scales well with speedup up to 12x and improved algorithms that are optimized based on properties unique to the NTD procedure.

General Memory-Independent Lower Bound for MTTKRP

A communication lower bound is established on the communication required to perform the Matricized-Tensor Times KhatriRao Product computation on a distributedmemory parallel machine, tightening the bound so that it is attainable even when the tensor dimensions vary and when the number of processors is small relative to the Tensor dimensions.

Alternating Mahalanobis Distance Minimization for Stable and Accurate CP Decomposition

A new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work is introduced and it is shown that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank and verify it experimentally.

Comparison of Accuracy and Scalability of Gauss-Newton and Alternating Least Squares for CP Decomposition

This work provides the first parallel implementation of a Gauss-Newton method for CP decomposition, which iteratively solves linear least squares problems at each Gaussian step and evaluates the performance of both sequential and parallel versions of both approaches.

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

This work proposes a hardware accelerator that can accelerate both dense and sparse tensor factorizations and co-designs the hardware and a sparse storage format, which allows accessing the sparse data in vectorized and streaming fashion and maximizes the utilization of the memory bandwidth.

Accelerating alternating least squares for tensor decomposition by pairwise perturbation

A novel family of algorithms that uses perturbative corrections to the subproblems rather than recomputing the tensor contractions is introduced, which is accurate when the factor matrices are changing little across iterations, which occurs when ALS approaches convergence.

Accelerated Stochastic Gradient for Nonnegative Tensor Completion and Parallel Implementation

A shared-memory implementation of the accelerated gradient algorithm is developed using the multithreaded API OpenMP, which attains significant speedup and is believed to be a very competitive candidate for the solution of very large nonnegative tensor completion problems.



Parallel Candecomp/Parafac Decomposition of Sparse Tensors Using Dimension Trees

A novel computational scheme for reducing the cost of a core operation in computing the CP decomposition with the traditional alternating least squares (CP-ALS) based algorithm is proposed and effectively parallelize this computational scheme in the context of CP-ALS in shared and distributed memory environments.

Model-Driven Sparse CP Decomposition for Higher-Order Tensors

A novel, adaptive tensor memoization algorithm, AdaTM, which allows a user to make a space-time tradeoff by automatically tuning algorithmic and machine parameters using a model-driven framework, making its performance more scalable for higher-order data problems.

High Performance Parallel Algorithms for the Tucker Decomposition of Sparse Tensors

  • O. KayaB. Uçar
  • Computer Science
    2016 45th International Conference on Parallel Processing (ICPP)
  • 2016
A set of preprocessing steps which takes all computational decisions out of the main iteration of the algorithm and provides an intuitive shared-memory parallelism for the TTM and TRSVD steps are discussed.

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Multi-dimensional arrays, or tensors, are increasingly found in fields such as signal processing and recommender systems. Real-world tensors can be enormous in size and often very sparse. There is a

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization

A medium-grained decomposition of the tensor nonzeros is presented that avoids complete factor replication and communication, while eliminating the need for expensive pre-processing steps and uses a hybrid MPI+OpenMP implementation that exploits multi-core architectures with a low memory footprint.

High Performance Parallel Algorithms for Tensor Decompositions

The main focus of this thesis is on efficient decomposition of high dimensional sparse tensors, with hundreds of millions to billions of nonzero entries, which arise in many emerging big data applications and introduces a tree-based computational scheme that carries out expensive operations faster by factoring out and storing common partial results and effectivelyre-using them.

PARAFAC algorithms for large-scale problems

Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product

This work establishes communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors and shows that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.

Nesterov-based parallel algorithm for large-scale nonnegative tensor factorization

It turns out that the attained speedup is significant, rendering the algorithm a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.

Nesterov-Based Alternating Optimization for Nonnegative Tensor Factorization: Algorithm and Parallel Implementation

It turns out that the derived algorithm is a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.