Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition

  title={Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition},
  author={Raden Mu'az Mun'im and Nakamasa Inoue and Koichi Shinoda},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We investigate the feasibility of sequence-level knowledge distillation of Sequence-to-Sequence (Seq2Seq) models for Large Vocabulary Continuous Speech Recognition (LVCSR). We first use a pre-trained larger teacher model to generate multiple hypotheses per utterance with beam search. With the same input, we then train the student model using these hypotheses generated from the teacher as pseudo labels in place of the original ground truth labels. We evaluate our proposed method using Wall… 

Figures and Tables from this paper

Sequence-Level Self-Learning with Multiple Hypotheses
New self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR) using the multi-task learning (MTL) framework where the n-th best ASR hypothesis is used as the label of each task.
Hierarchical Transformer-Based Large-Context End-To-End ASR with Large-Context Knowledge Distillation
A hierarchical transformer-based large- Context E2E-ASR model that combines the transformer architecture with hierarchical encoder-decoder based large-context modeling is proposed and a large- context knowledge distillation that distills the knowledge from a pre-trained large- contexts language model in the training phase is proposed.
Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition
This work extends the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance.
Large scale weakly and semi-supervised learning for low-resource video ASR
A large scale systematic comparison between two self-labeling methods, and weakly-supervised pretraining using contextual metadata on the challenging task of transcribing social media videos in low-resource conditions is conducted.
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
A novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed, which reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data.
Sequence-Level Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition
The experiments show that the semi-supervised learning proposal with sequence-level consistency training can efficiently improve ASR performance using unlabeled speech data.
Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition
A novel knowledge transfer and distillation architecture is proposed that leverages knowledge from AR models to improve the NAR performance while reducing the model’s size.
Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its
Sharing Attention Weights for Fast Transformer
This paper speed up Transformer via a fast and lightweight attention model and share attention weights in adjacent layers and enable the efficient re-use of hidden states in a vertical manner.


Sequence-Level Knowledge Distillation
It is demonstrated that standard knowledge distillation applied to word-level prediction can be effective for NMT, and two novel sequence-level versions of knowledge distilling are introduced that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search.
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
An Investigation of a Knowledge Distillation Method for CTC Acoustic Models
To improve the performance of unidirectional RNN-based CTC, which is suitable for real-time processing, the knowledge distillation (KD)-based model compression method for training a CTC acoustic model is investigated and a frame-level and a sequence-level KD method are evaluated.
End-to-end attention-based large vocabulary speech recognition
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Acoustic modelling with CD-CTC-SMBR LSTM RNNS
This paper describes a series of experiments to extend the application of Context-Dependent long short-term memory recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss and investigates transferring knowledge from one network to another through alignments.
Attention-based Wav2Text with feature transfer learning
Experimental results reveal that the proposed Attention-based Wav2Text model directly with raw waveform could achieve a better result in comparison with the attentional encoder-decoder model trained on standard front-end filterbank features.
Attention-Based Models for Speech Recognition
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
Building DNN acoustic models for large vocabulary speech recognition