• Corpus ID: 239050134

Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach

  title={Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach},
  author={Mun-hak Lee and Joon-Hyuk Chang},
  • Mun-hak Lee, Joon-Hyuk Chang
  • Published 20 October 2021
  • Computer Science, Engineering
  • ArXiv
The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech recognition systems with massive deep learning-based LMs is a major topic of speech recognition research. Among the various methods of applying LMs to speech recognition systems, in this paper, we focus on a cross-modal knowledge distillation method that… 

Figures and Tables from this paper


Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition
This paper proposes a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model, which achieves a character error rate of 9.3%, which is relatively reduced by 18.42% compared with the vanilla sequence- ToS model.
Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition
This work hypothesizes that using intermediate representations as auxiliary supervision at lower levels of deep networks may be a good way of combining the advantages of end-to-end training and more traditional pipeline approaches.
Hierarchical Multitask Learning for CTC-based Speech Recognition
It is observed that the hierarchical multitask approach improves over standard multitask training in higher-data experiments, while in the low-resource settings standard multitasks training works well.
A Fully Differentiable Beam Search Decoder
It is shown that it is possible to discriminatively train an acoustic model jointly with an explicit and possibly pre-trained language model, while implicitly learning a language model.
End-to-end attention-based large vocabulary speech recognition
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Two Efficient Lattice Rescoring Methods Using Recurrent Neural Network Language Models
Two efficient lattice rescoring methods for RNNLMs are proposed and produced 1-best performance comparable to a 10 k-best rescoring baseline RNNLM system on two large vocabulary conversational telephone speech recognition tasks for US English and Mandarin Chinese.
Speech recognition with deep recurrent neural networks
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Distilling the Knowledge of BERT for Sequence-to-Sequence ASR
This work leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation, and outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.
Distilling the Knowledge in a Neural Network
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.