Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity

  title={Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity},
  author={Peter Wu and Jiatong Shi and Yifan Zhong and Shinji Watanabe and Alan W. Black},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  • Peter Wu, Jiatong Shi, A. Black
  • Published 2 November 2021
  • Computer Science, Linguistics
  • 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Speech processing systems currently do not support the vast majority of languages, in part due to the lack of data in low-resource languages. Cross-lingual transfer offers a compelling way to help bridge this digital divide by incorporating high-resource data into low-resource systems. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. However, scaling up speech systems to support hundreds of low-resource languages… 

Figures and Tables from this paper

When Is TTS Augmentation Through a Pivot Language Useful?
This work proposes an alter-native: produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language, and investigates when and how this technique is most effective in low-resource settings.


Sequence-Based Multi-Lingual Low Resource Speech Recognition
It is shown that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss and can be adapted cross-lingually to an unseen language using just 25% of the target data.
Unsupervised Cross-lingual Representation Learning for Speech Recognition
XLSR is presented which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages to enable a single multilingual speech recognition model which is competitive to strong individual models.
Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning
A large-scale end-to-end languageindependent multilingual model for joint automatic speech recognition (ASR) and language identification (LID) and achieves word error rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data is reported.
Using resources from a closely-related language to develop ASR for a very under-resourced language: a case study for iban
A semi-supervised method for building the pronunciation dictionary and applied cross-lingual strategies for improving acoustic models trained with very limited training data displayed very encouraging results, which show that data from a closely-related language, if available, can be exploited to build ASR for a new language.
Multilingual Speech Recognition with a Single End-to-End Model
This model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually and improves performance by an additional 7% relative and eliminate confusion between different languages.
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
Language independent end-to-end architecture for joint language identification and speech recognition
This paper presents a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition, based on the hybrid attention/connectionist temporal classification (CTC) architecture.
SIGTYP 2021 Shared Task: Robust Spoken Language Identification
Domain and speaker mismatch proves very challenging for current methods which can perform above 95% accuracy in-domain, which domain adaptation can address to some degree, but that these conditions merit further investigation to make spoken language identification accessible in many scenarios.
Cross-Lingual Natural Language Generation via Pre-Training
Experimental results on question generation and abstractive summarization show that the model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation and improves NLG performance of low-resource languages by leveraging rich-resource language data.
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
It is shown that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages.