• Corpus ID: 237635125

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

  title={Simple and Effective Zero-shot Cross-lingual Phoneme Recognition},
  author={Qiantong Xu and Alexei Baevski and Michael Auli},
Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the… 

Figures and Tables from this paper

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.


Self-Training and Pre-Training are Complementary for Speech Recognition
  • Qiantong Xu, Alexei Baevski, +5 authors Michael Auli
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
P pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups to improve speech recognition systems using unlabeled data.
Sequence-Based Multi-Lingual Low Resource Speech Recognition
It is shown that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss and can be adapted cross-lingually to an unseen language using just 25% of the target data.
Towards Zero-shot Learning for Automatic Phonemic Transcription
This model is able to recognize unseen phonemes in the target language without any training data and achieves 7.7% better phoneme error rate on average over a standard multilingual model.
Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding
Many existing languages are too sparsely resourced for monolingual deep learning networks to achieve high accuracy. Multilingual phonetic recognition systems mitigate data sparsity issues by training
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
Universal Phone Recognition with a Multilingual Allophone System
This work proposes a joint model of both language-independent phone and language-dependent phoneme distributions that can build a (nearly-)universal phone recognizer that, when combined with the PHOIBLE [1] large, manually curated database of phone inventories, can be customized into 2,000 language dependent recognizers.
Differentiable Allophone Graphs for Language-Universal Speech Recognition
This work presents a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights represented using weighted finite-state transducers, which they are called differentiable allophone graphs.
Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language
Through finer-grained analysis, it is shown that training on even just a single related language gives the largest gain in word discrimination and query-by-example search evaluations, and that adding data from unrelated languages generally doesn’t hurt performance.
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition and the algorithm uses a gumbel softmax or online k-means clustering to quantize the dense representations.
PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
It is shown that phonological features outperform character-based models in PanPhon, a database relating over 5,000 IPA segments to 21 subsegmental articulatory features that boosts performance in various NER-related tasks.