HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

@article{Gu2022HEATHA,
  title={HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression},
  author={Jiaqi Gu and Ben Keller and Jean Kossaifi and Anima Anandkumar and Brucek Khailany and David Z. Pan},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.16749},
  url={https://api.semanticscholar.org/CorpusID:254096167}
}
AHardware-aware tensor decomposition framework is proposed that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization and jointly investigates tensor contraction path optimizations and a fused Einsum mapping strategy.

Learning Low-Rank Tensor Cores with Probabilistic ℓ0-Regularized Rank Selection for Model Compression

A novel automatic rank selection method for deep model compression that allows learning model weights and decomposition ranks simultaneously and enables the automatic rank selection to be incorporated with arbitrary tensor decompositions and neural network layers such as linear layers, convolutional layers, and embedding layers is proposed.

Partial Tensorized Transformers for Natural Language Processing

This work focuses both on embedding-layer compression and partial tensorization of neural networks (PTNN) through an algorithmic approach, and significantly improves the accuracy of existing models by up to 5%, all without the need for post-training adjustments.

Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection

This paper presents a unified framework that simultaneously applies decomposition and optimal rank selection, employing a composite compression loss within defined rank constraints, and maintains the performance of highly compressed models on par with their original counterparts.

ESPACE: Dimensionality Reduction of Activations for Model Compression

Comparison with related works on compressing Llama2-7B via matrix factorization shows that ESPACE is a first step in advancing the state-of-the-art in tensor decomposition compression of LLMs.

CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization

CoMERA achieves rank-adaptive tensor-compressed (pre-training) via a multi-objective optimization formulation and improves the training to provide both a high compression ratio and excellent accuracy in the training process.

Quantum-Inspired Tensor Network for Earth Science

A quantum-inspired tensor network is employed for compressing trainable parameters of physics-informed neural networks (PINNs) in Earth science and the spectral resolution of remotely-sensed images is improved by employing tensor decomposition.

Smartformer: An intelligent transformer compression framework for time-series modeling

An intelligent model compression framework, Smartformer, is proposed by incorporating reinforcement learning and CP-decomposition techniques to satisfy the aforementioned three objectives and can mitigate the overfitting issue and thus improve the accuracy of the existing time-series models in all scenarios.

Gradient-Free Structured Pruning with Unlabeled Data

This paper proposes a gradient-free structured pruning framework that uses only unlabeled data and shows that up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.

Transformers in Speech Processing: A Survey

By consolidating findings from across the speech technology landscape, this paper provides a valuable resource for researchers interested in harnessing the power of transformers to advance the field.

Deeply Tensor Compressed Transformers for End-to-End Object Detection

This paper proposes to deeply compress the transformers with low-rank tensor decomposition to obtain a compact end-to-end detection framework and proposes a gated multi-head attention (GMHA) module to mitigate the accuracy drop of the tensor-compressed DETR models.

TIE: Energy-efficient Tensor Train-based Inference Engine for Deep Neural Network

A computation-efficient inference scheme for TT-format DNN, which enjoys two key merits: 1) it achieves theoretical limit of number of multiplications, thus eliminating all redundant computations; and 2) the multi-stage processing scheme reduces the intensive memory access to all tensor cores, bringing significant energy saving.

TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

The promising potential of Tensor Train decomposition for DLRMs (TT-Rec) is demonstrated and the effect of weight initialization distribution on DLRM accuracy and proposed to initialize the tensor cores of TT-Rec following the sampled Gaussian distribution is presented.

Tensor Methods in Computer Vision and Deep Learning

This article provides an in-depth and practical review of tensors and tensor methods in the context of representation learning and deep learning, with a particular focus on visual data analysis and computer vision applications.

Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination

This work provides the first general-purpose rank-adaptive framework for end-to-end tensorized training of neural networks and develops a scalable stochastic variational inference solver to estimate the posterior density of large-scale problems in training.

A Tensorized Transformer for Language Modeling

A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements.

MiniViT: Compressing Vision Transformers with Weight Multiplexing

MiniViT, a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance, and makes the weights shared across layers, while imposing a transformation on the weights to increase diversity.

Tensor Decomposition for Compressing Recurrent Neural Network

This paper utilizes several tensor decompositions method including CANDECOMP/PARAFAC, Tucker decomposition and Tensor Train to re-parameterize the Gated Recurrent Unit (GRU) RNN to reduce the number of parameters and maintain the expressive power from RNN simultaneously.

Tensorized Embedding Layers

A novel way of parameterizing embedding layers based on the Tensor Train decomposition is introduced, which allows compressing the model significantly at the cost of a negligible drop or even a slight gain in performance.

Tensorizing Neural Networks

This paper converts the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.