Pseudo-Labeling for Massively Multilingual Speech Recognition

@inproceedings{Lugosch2022PseudoLabelingFM,
  title={Pseudo-Labeling for Massively Multilingual Speech Recognition},
  author={Loren Lugosch and Tatiana Likhomanenko and Gabriel Synnaeve and Ronan Collobert},
  booktitle={ICASSP},
  year={2022}
}
Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised learning on a target language, generate pseudo-labels for that language, and train a final model… 

References

SHOWING 1-10 OF 38 REFERENCES
Iterative Pseudo-Labeling for Speech Recognition
TLDR
This work studies Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves, and demonstrates the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets.
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
We introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning
Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
TLDR
This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages.
Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning
TLDR
A large-scale end-to-end languageindependent multilingual model for joint automatic speech recognition (ASR) and language identification (LID) and achieves word error rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data is reported.
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the
Unsupervised Cross-lingual Representation Learning for Speech Recognition
TLDR
XLSR is presented which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages to enable a single multilingual speech recognition model which is competitive to strong individual models.
Self-Training for End-to-End Speech Recognition
  • Jacob Kahn, Ann Lee, Awni Y. Hannun
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition.
Multilingual Speech Recognition with a Single End-to-End Model
TLDR
This model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually and improves performance by an additional 7% relative and eliminate confusion between different languages.
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
Pseudo-labeling (PL) has been shown to be effective in semisupervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL
Phonemic and Graphemic Multilingual CTC Based Speech Recognition
TLDR
This work extended the previous approach towards training CTC-based systems multilingually, and built systems based on graphemes or phonemes, which could reduce the gap between these mono- and multilingual setups.
...
...