Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages

  title={Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages},
  author={Jiyeon Kim and Mehul Kumar and Dhananjaya N. Gowda and Abhinav Garg and Chanwoo Kim},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  • Jiyeon Kim, Mehul Kumar, Chanwoo Kim
  • Published 19 November 2021
  • Computer Science
  • 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS… 

Figures and Tables from this paper


Semi-supervised learning for speech recognition in the context of accent adaptation
This paper experiments with cross-entropy based speaker selection to adapt a source recognizer to a target accent in a semi-supervised manner, using additional data with no accent labels, and obtains significant improvements over the baseline by leveraging additional unlabeled data on two different tasks in Arabic and English.
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition
This paper compares a suite of past methods and some of their own proposed methods for using unpaired text data to improve encoder-decoder models, and results confirm the benefits of using unpaired text across a range of methods and data sets.
Speech Model Pre-training for End-to-End Spoken Language Understanding
A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.
Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language
This paper studied an end-to-end speech recognition model based on sample transfer learning for the low-resource Tujia language, and showed that the recognition error rate of the proposed model is 2.11% lower than the that of the model that only used the TuJia language data for training.
Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition
This paper explores how the current speech synthesis technology can be leveraged to tailor the ASR system for a target domain by preparing only a relevant text corpus and generates speech features using a sequence-to-sequence speech synthesizer.
Transfer Learning for Speech Recognition on a Budget
This work conducts several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language, showing that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy.
Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition
The proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive, which shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.
Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
This work proposes a new semi-supervised loss combining an end-to-end differentiable ASR loss that is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of \%WER.
Semi-supervised training in low-resource ASR and KWS
A set of experiments on low-resource languages in telephony speech quality in Assamese, Bengali, Lao, Haitian, Zulu, and Tamil are presented, demonstrating the impact that semi-supervised training and speaker adaptation techniques can have, in particular learning robust bottle-neck features on the test data.
End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System
The authors' end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).