• Corpus ID: 59413833

Tensorized Embedding Layers for Efficient Model Compression

@article{Khrulkov2019TensorizedEL,
  title={Tensorized Embedding Layers for Efficient Model Compression},
  author={Valentin Khrulkov and Oleksii Hrinchuk and Leyla Mirvakhabova and I. Oseledets},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.10787}
}
The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large (e.g., 800k unique words in the One-Billion-Word dataset), the corresponding weight matrices can be enormous, which precludes their deployment in a limited resource setting. We introduce a novel way of parametrizing embedding layers based on the Tensor Train (TT) decomposition, which allows compressing the model… 

Figures and Tables from this paper

Improving Word Embedding Factorization for Compression using Distilled Nonlinear Neural Decomposition
TLDR
The proposed Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation, has higher BLEU score on translation and lower perplexity on language modeling compared to complex, difficult to tune state-of-the-art methods.
Distilled embedding: non-linear embedding factorization using knowledge distillation
TLDR
This paper proposes Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity, and shows that the proposed technique outperforms conventional low- rank matrix factorization, and other recently proposed word-embedding matrix compression methods.
Exploring Extreme Parameter Compression for Pre-trained Language Models
TLDR
This work aims to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one, and shows that the proposed method is orthogonal to existing compression methods like knowledge distillation.
Training with Multi-Layer Embeddings for Model Reduction
TLDR
A multi-layer embedding training (MLET) architecture that trains embeddings via a sequence of linear layers to derive superior embedding accuracy vs. model size trade-off is introduced.
Nimble GNN Embedding with Tensor-Train Decomposition
TLDR
A new method for representing embedding tables of graph neural networks (GNNs) more compactly via tensor-train (TT) decomposition is described, which can reduce the size of node embedding vectors by 1,659 × to 81,362 × on large pub-licly available benchmark datasets.
Improving Neural Machine Translation with Compact Word Embedding Tables
TLDR
It is demonstrated that in exchange for negligible deterioration in performance, any NMT model can be run with partially random embeddings, which means a minimal memory requirement as there is no longer need to store large embedding tables, which is a significant gain in industrial and on-device settings.
TT-REC: TENSOR TRAIN COMPRESSION FOR DEEP LEARNING RECOMMENDATION MODEL EMBEDDINGS
TLDR
The promising potential of Tensor Train decomposition for DLRMs (TT-Rec) is demonstrated and the effect of weight initialization distribution on DLRM accuracy and proposed to initialize the tensor cores of TT-Rec following the sampled Gaussian distribution is presented.
A Tensorized Transformer for Language Modeling
TLDR
A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements.
Compressing Speech Recognition Networks with MLP via Tensor-Train Decomposition
  • Dan He, Yu-bin Zhong
  • Computer Science
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2019
TLDR
This paper investigates a compression approach for DNNs based on Tensor-Train (TT) decomposition and applies it to the ASR task and reveals that the compressed networks can maintain the performance of the original full-connected network, while greatly reducing the number of parameters.
KroneckerBERT: Significant Compression of Pre-trained Language Models Through Kronecker Decomposition and Knowledge Distillation
TLDR
The KroneckerBERT is a compressed version of the BERT_BASE model obtained by compressing the embedding layer and the linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layers, and is trained via a very efficient two-stage knowledge distillation scheme using far fewer data samples.
...
...

References

SHOWING 1-10 OF 68 REFERENCES
Wide Compression: Tensor Ring Nets
TLDR
This work introduces Tensor Ring Networks (TR-Nets), which significantly compress both the fully connected layers and the convolutional layers of deep neural networks, and shows promise in scientific computing and deep learning, especially for emerging resource-constrained devices such as smartphones, wearables and IoT devices.
GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking
TLDR
GroupReduce is proposed, a novel compression method for neural language models, based on vocabulary-partition based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words).
A Tensorized Transformer for Language Modeling
TLDR
A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements.
Tensor-Train Recurrent Neural Networks for Video Classification
TLDR
A new, more general and efficient approach by factorizing the input-to-hidden weight matrix using Tensor-Train decomposition which is trained simultaneously with the weights themselves which provides a novel and fundamental building block for modeling high-dimensional sequential data with RNN architectures.
Tensorizing Neural Networks
TLDR
This paper converts the dense weight matrices of the fully-connected layers to the Tensor Train format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved.
Compressing recurrent neural network with tensor train
TLDR
This paper proposes an alternative RNN model to reduce the number of parameters significantly by representing the weight parameters based on Tensor Train (TT) format and implements the TT-format representation for several RNN architectures such as simple RNN and Gated Recurrent Unit (GRU).
Using the Output Embedding to Improve Language Models
TLDR
The topmost weight matrix of neural network language models is studied and it is shown that this matrix constitutes a valid word embedding and a new method of regularizing the output embedding is offered.
Ultimate tensorization: compressing convolutional and FC layers alike
TLDR
This paper combines the proposed approach with the previous work to compress both convolutional and fully-connected layers of a network and achieve 80x network compression rate with 1.1% accuracy drop on the CIFAR-10 dataset.
Adaptive Input Representations for Neural Language Modeling
TLDR
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
...
...