Corpus ID: 218581886

Incremental Learning for End-to-End Automatic Speech Recognition

  title={Incremental Learning for End-to-End Automatic Speech Recognition},
  author={Li Fu and Xiaoxiao Li and Libo Zi},
We propose an incremental learning for end-to-end Automatic Speech Recognition (ASR) to extend the model's capacity on a new task while retaining the performance on existing ones. The proposed method is effective without accessing to the old dataset to address the issues of high training cost and old dataset unavailability. To achieve this, knowledge distillation is applied as a guidance to retain the recognition ability from the previous model, which is then combined with the new ASR task for… Expand

Figures and Tables from this paper

SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition
  • Li Fu, Xiaoxiao Li, +4 authors Bowen Zhou
  • Computer Science, Engineering
  • ArXiv
  • 2021
A novel framework of Supervised Contrastive Learning (SCaLa) is proposed to enhance phonemic information learning for end-to-end ASR systems and can mitigate the noise of positive-negative pairs in self-supervised MCPC. Expand
Towards Lifelong Learning of End-to-end ASR
This paper reports the first effort to extensively consider and analyze the use of various approaches of LLL in end-to-end (E2E) ASR, including proposing novel methods in saving data for past domains to mitigate the catastrophic forgetting problem. Expand


Domain Expansion in DNN-Based Acoustic Models for Robust Speech Recognition
This study studies several domain expansion techniques which exploit only the data of the new domain to build a stronger model for all domains and evaluates these techniques in an accent adaptation task in which a DNN acoustic model is adapted to three different English accents. Expand
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
This work proposes a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable, and shows that even non-attention based models learn to localize discriminative regions of input image. Expand
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets. Expand
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing. Expand
A continual learning survey: Defying forgetting in classification tasks.
This work focuses on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries and study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time and storage. Expand
Knowledge Distillation: A Survey
A comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, distillation algorithms and applications is provided. Expand
Continual Learning for Multi-Dialect Acoustic Models
This work demonstrates that by using loss functions that mitigate catastrophic forgetting, sequential transfer learning can be used to train multi-dialect acoustic models that narrow the WER gap between the best (combined training) and worst (fine-tuning) case by up to 65%. Expand
Continual Learning in Automatic Speech Recognition
This work emulates continual learning observed in real life, where new training data are used for gradual improvement of an Automatic Speech Recognizer trained on old domains and appears to yield slight advantage over offline multi-condition training. Expand
Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning
Two unsupervised pre-training strategies for the encoder and the decoder of Transformer respectively are proposed, which make full use of unpaired data for training, and a new semi-supervised fine-tuning method named multi-task semantic knowledge learning is proposed to strengthen the Transformer’s ability to learn about semantic knowledge, thereby improving the system performance. Expand
Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition
Experimental results show that Mandarin transformer transducer using syllable with tone achieves the best performance and a new mix-bandwidth training method is presented to obtain a general model that is able to accurately recognize Mandarin speech with different sampling rates simultaneously. Expand