Extreme Model Compression for On-device Natural Language Understanding

  title={Extreme Model Compression for On-device Natural Language Understanding},
  author={Kanthashree Mysore Sathyendra and Samridhi Choudhary and Leah Nicolich-Henkin},
In this paper, we propose and experiment with techniques for extreme compression of neural natural language understanding (NLU) models, making them suitable for execution on resource-constrained devices. We propose a task-aware, end-to-end compression approach that performs word-embedding compression jointly with NLU task learning. We show our results on a large-scale, commercial NLU system trained on a varied set of intents with huge vocabulary sizes. Our approach outperforms a range of… 

Figures and Tables from this paper

Learning a Neural Diff for Speech Models

This work presents neural update approaches for release of subsequent speech model generations abiding by a data budget, and details two architecture-agnostic methods which learn compact representations for transmission to devices.

SmallER: Scaling Neural Entity Resolution for Edge Devices

This paper introduces SmallER, a scalable neural entity resolution system capable of running directly on edge devices and uses compressed tries to reduce the space required to store catalogs and a novel implementation of spatial partitioning trees to strike a balance between reducing runtime latency and preserving recall relative to full catalog search.

Caching Networks: Capitalizing on Common Speech for ASR

This work introduces Caching Networks (CachingNets), a speech recognition network architecture capable of delivering faster, more accurate decoding by leveraging common speech patterns and proposes and experiments with different phrase caching policies, which are effective for virtual voice-assistant (VA) applications.

TINYS2I: A Small-Footprint Utterance Classification Model with Contextual Support for On-Device SLU

TinyS2I brings latency reduction without accuracy degradation, by exploiting use cases when the distribution of utterances that users speak to a device is largely heavy-tailed.

Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge

This work proposes factorization-aware training (FAT), wherein the authors factorize the linear mappings of an already compressed transformer model (DistilBERT) and train jointly on NLU tasks and introduces a new metric called factorization gap, which is used to analyze the need for FAT across various model components.



Compressing Word Embeddings via Deep Compositional Code Learning

This work proposes to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick, and achieves 98% in a sentiment analysis task and 94% ~ 99% in machine translation tasks without performance loss.

Online Embedding Compression for Text Classification using Low Rank Matrix Factorization

A compression method that leverages low rank matrix factorization during training, to compress the word embedding layer which represents the size bottleneck for most NLP models, and introduces a novel learning rate schedule, the Cyclically Annealed Learning Rate, which empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark.

Near-lossless Binarization of Word Embeddings

Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ~2% in accuracy while vector size is reduced, and a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.

ONENET: Joint domain, intent, slot prediction for spoken language understanding

This work presents a unified neural network that jointly performs domain, intent, and slot predictions in spoken language understanding systems and adopts a principled architecture for multitask learning to fold in the state-of-the-art models for each task.

FastText.zip: Compressing text classification models

This work proposes a method built upon product quantization to store the word embeddings, which produces a text classifier, derived from the fastText approach, which at test time requires only a fraction of the memory compared to the original one, without noticeably sacrificing the quality in terms of classification accuracy.

Recurrent neural networks for language understanding

This paper modify the architecture to perform Language Understanding, and advance the state-of-the-art for the widely used ATIS dataset.

Personalized speech recognition on mobile devices

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5

A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding

A joint model is proposed based on the idea that the intent and semantic slots of a sentence are correlative, and it outperforms the state-of-the-art approaches on both tasks.

Simple and Effective Dimensionality Reduction for Word Embeddings

This work presents a novel algorithm that effectively combines PCA based dimensionality reduction with a recently proposed post-processing algorithm, to construct word embeddings of lower dimensions.

Recurrent neural network and LSTM models for lexical utterance classification

This work proposes RNN and LSTM models for utterance intent classification and finds that RNNs work best when utterances are short, while LSTMs are best when uttered words are longer.