Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages
@article{Kim2021SemiSupervisedTL, title={Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages}, author={Jiyeon Kim and Mehul Kumar and Dhananjaya N. Gowda and Abhinav Garg and Chanwoo Kim}, journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, year={2021}, pages={984-988} }
In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-trained English model, unlabeled text corpus, and unlabeled audio corpus using transfer learning, TTS…
References
SHOWING 1-10 OF 31 REFERENCES
Semi-supervised learning for speech recognition in the context of accent adaptation
- Computer ScienceMLSLP
- 2012
This paper experiments with cross-entropy based speaker selection to adapt a source recognizer to a target accent in a semi-supervised manner, using additional data with no accent labels, and obtains significant improvements over the baseline by leveraging additional unlabeled data on two different tasks in Arabic and English.
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
This paper compares a suite of past methods and some of their own proposed methods for using unpaired text data to improve encoder-decoder models, and results confirm the benefits of using unpaired text across a range of methods and data sets.
Speech Model Pre-training for End-to-End Spoken Language Understanding
- Computer ScienceINTERSPEECH
- 2019
A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.
Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language
- Computer ScienceSymmetry
- 2019
This paper studied an end-to-end speech recognition model based on sample transfer learning for the low-resource Tujia language, and showed that the recognition error rate of the proposed model is 2.11% lower than the that of the model that only used the TuJia language data for training.
Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
This paper explores how the current speech synthesis technology can be leveraged to tailor the ASR system for a target domain by preparing only a relevant text corpus and generates speech features using a sequence-to-sequence speech synthesizer.
Transfer Learning for Speech Recognition on a Budget
- Computer ScienceRep4NLP@ACL
- 2017
This work conducts several systematic experiments adapting a Wav2Letter convolutional neural network originally trained for English ASR to the German language, showing that this technique allows faster training on consumer-grade resources while requiring less training data in order to achieve the same accuracy.
Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition
- Computer ScienceINTERSPEECH
- 2020
The proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive, which shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.
Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
- Computer ScienceINTERSPEECH
- 2019
This work proposes a new semi-supervised loss combining an end-to-end differentiable ASR loss that is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of \%WER.
Semi-supervised training in low-resource ASR and KWS
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
A set of experiments on low-resource languages in telephony speech quality in Assamese, Bengali, Lao, Haitian, Zulu, and Tamil are presented, demonstrating the impact that semi-supervised training and speaker adaptation techniques can have, in particular learning robust bottle-neck features on the test data.
End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
The authors' end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).