Compression of End-to-End Models

@inproceedings{Pang2018CompressionOE,
  title={Compression of End-to-End Models},
  author={Ruoming Pang and Tara N. Sainath and Rohit Prabhavalkar and Suyog Gupta and Yonghui Wu and Shuyuan Zhang and Chung-Cheng Chiu},
  booktitle={INTERSPEECH},
  year={2018}
}
End-to-end models, which directly output text given speech using a single neural network, have been shown to be competitive with conventional speech recognition models containing separate acoustic, pronunciation, and language model components. Such models do not require additional resources for decoding and are typically much smaller than conventional models. This makes them particularly attractive in the context of ondevice speech recognition where both small memory footprint and low power… 

Tables from this paper

Recent Advances in End-to-End Automatic Speech Recognition

  • Jinyu Li
  • Computer Science
    APSIPA Transactions on Signal and Information Processing
  • 2022
TLDR
This paper will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective.

A Comparison of End-to-End Models for Long-Form Speech Recognition

  • C. ChiuWei Han Yonghui Wu
  • Computer Science
    2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2019
TLDR
This paper investigates and improves the performance of end-to-end models on long-form transcription and explores two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments.

Iterative Compression of End-to-End ASR Model using AutoML

TLDR
This work proposes an iterative AutoML-based LRF approach that achieves over 5x compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression.

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

TLDR
This work analyzes the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance and proposes two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.

ShrinkML: End-to-End ASR Model Compression Using Reinforcement Learning

TLDR
An AutoML system that uses reinforcement learning (RL) to optimize the per-layer compression ratios when applied to a state-of-the-art attention based end-to-end ASR model composed of several LSTM layers using singular value decomposition (SVD) low-rank matrix factorization as the compression method.

Knowledge Distillation Using Output Errors for Self-attention End-to-end Models

  • Ho-Gyeong KimHwidong Na Y. S. Choi
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
In order to overcome the performance degradation of compressed models, the proposed method adds an exponential weight to the sequence-level knowledge distillation loss function, which reflects the word error rate of the output of the teacher model based on the ground-truth word sequences.

Learning a Neural Diff for Speech Models

TLDR
This work presents neural update approaches for release of subsequent speech model generations abiding by a data budget, and details two architecture-agnostic methods which learn compact representations for transmission to devices.

Extremely Low Footprint End-to-End ASR System for Smart Device

TLDR
This work proposes an extremely low footprint E2E ASR system for smart devices, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy, and design cross-layer weight sharing to improve parameter efficiency and exploit model compression methods including sparsification and quantization, to reduce memory storage and boost decoding efficiency.

Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data

TLDR
This work proposes a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models.

Improving End-to-End Single-Channel Multi-Talker Speech Recognition

TLDR
An enhanced end-to-end monaural multi- talker ASR architecture and training strategy to recognize the overlapped speech and demonstrates that the proposed architectures can significantly improve the multi-talker mixed speech recognition.

References

SHOWING 1-10 OF 40 REFERENCES

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

TLDR
This work investigates training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T) and finds that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors.

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

TLDR
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.

An Investigation of a Knowledge Distillation Method for CTC Acoustic Models

TLDR
To improve the performance of unidirectional RNN-based CTC, which is suitable for real-time processing, the knowledge distillation (KD)-based model compression method for training a CTC acoustic model is investigated and a frame-level and a sequence-level KD method are evaluated.

On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition

TLDR
This work presents a technique for general recurrent model compression that jointly compresses both recurrent and non-recurrent inter-layer weight matrices and finds that the proposed technique allows us to reduce the size of the authors' Long Short-Term Memory (LSTM) acoustic model to a third of its original size with negligible loss in accuracy.

End-to-end attention-based large vocabulary speech recognition

TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

Lower Frame Rate Neural Network Acoustic Models

TLDR
On a large vocabulary Voice Search task, it is shown that with conventional models, one can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model, thus improving overall system speed.

Knowledge distillation for small-footprint highway networks

TLDR
This paper significantly improved the recognition accuracy of the HDNN acoustic model with less than 0.8 million parameters, and narrowed the gap between this model and the plain DNN with 30 million parameters.

Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition

TLDR
This paper describes how to use knowledge distillation to combine acoustic models in a way that improves recognition accuracy significantly, can be implemented with standard training tools, and requires no additional complexity during recognition.

Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets

TLDR
A low-rank matrix factorization of the final weight layer is proposed and applied to DNNs for both acoustic modeling and language modeling, showing an equivalent reduction in training time and a significant loss in final recognition accuracy compared to a full-rank representation.