Extreme Model Compression for On-device Natural Language Understanding

  title={Extreme Model Compression for On-device Natural Language Understanding},
  author={Kanthashree Mysore Sathyendra and Samridhi Choudhary and Leah Nicolich-Henkin},
In this paper, we propose and experiment with techniques for extreme compression of neural natural language understanding (NLU) models, making them suitable for execution on resource-constrained devices. We propose a task-aware, end-to-end compression approach that performs word-embedding compression jointly with NLU task learning. We show our results on a large-scale, commercial NLU system trained on a varied set of intents with huge vocabulary sizes. Our approach outperforms a range of… 

Figures and Tables from this paper

Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge
This work proposes factorization-aware training (FAT), wherein the authors factorize the linear mappings of an already compressed transformer model (DistilBERT) and train jointly on NLU tasks and introduces a new metric called factorization gap, which is used to analyze the need for FAT across various model components.
TINYS2I: A Small-Footprint Utterance Classification Model with Contextual Support for On-Device SLU
TinyS2I brings latency reduction without accuracy degradation, by exploiting use cases when the distribution of utterances that users speak to a device is largely heavy-tailed.
Learning a Neural Diff for Speech Models
This work presents neural update approaches for release of subsequent speech model generations abiding by a data budget, and details two architecture-agnostic methods which learn compact representations for transmission to devices.
SmallER: Scaling Neural Entity Resolution for Edge Devices
This paper introduces SmallER, a scalable neural entity resolution system capable of running directly on edge devices and uses compressed tries to reduce the space required to store catalogs and a novel implementation of spatial partitioning trees to strike a balance between reducing runtime latency and preserving recall relative to full catalog search.
Caching Networks: Capitalizing on Common Speech for ASR
This work introduces Caching Networks (CachingNets), a speech recognition network architecture capable of delivering faster, more accurate decoding by leveraging common speech patterns and proposes and experiments with different phrase caching policies, which are effective for virtual voice-assistant (VA) applications.


Online Embedding Compression for Text Classification using Low Rank Matrix Factorization
A compression method that leverages low rank matrix factorization during training, to compress the word embedding layer which represents the size bottleneck for most NLP models, and introduces a novel learning rate schedule, the Cyclically Annealed Learning Rate, which empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark.
Near-lossless Binarization of Word Embeddings
Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ~2% in accuracy while vector size is reduced, and a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.
ONENET: Joint domain, intent, slot prediction for spoken language understanding
This work presents a unified neural network that jointly performs domain, intent, and slot predictions in spoken language understanding systems and adopts a principled architecture for multitask learning to fold in the state-of-the-art models for each task.
FastText.zip: Compressing text classification models
This work proposes a method built upon product quantization to store the word embeddings, which produces a text classifier, derived from the fastText approach, which at test time requires only a fraction of the memory compared to the original one, without noticeably sacrificing the quality in terms of classification accuracy.
Recurrent neural networks for language understanding
This paper modify the architecture to perform Language Understanding, and advance the state-of-the-art for the widely used ATIS dataset.
Personalized speech recognition on mobile devices
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5
A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding
A joint model is proposed based on the idea that the intent and semantic slots of a sentence are correlative, and it outperforms the state-of-the-art approaches on both tasks.
Simple and Effective Dimensionality Reduction for Word Embeddings
This work presents a novel algorithm that effectively combines PCA based dimensionality reduction with a recently proposed post-processing algorithm, to construct word embeddings of lower dimensions.
Recurrent neural network and LSTM models for lexical utterance classification
This work proposes RNN and LSTM models for utterance intent classification and finds that RNNs work best when utterances are short, while LSTMs are best when uttered words are longer.
Joint semantic utterance classification and slot filling with recursive neural networks
Recursive neural networks can be used to perform the core spoken language understanding (SLU) tasks in a spoken dialog system, more specifically domain and intent determination, concurrently with slot filling, in one jointly trained model.