• Corpus ID: 238583266

Injecting Text and Cross-lingual Supervision in Few-shot Learning from Self-Supervised Models

  title={Injecting Text and Cross-lingual Supervision in Few-shot Learning from Self-Supervised Models},
  author={Matthew Wiesner and Desh Raj and Sanjeev Khudanpur},
Self-supervised model pre-training has recently garnered significant interest, but relatively few efforts have explored using additional resources in fine-tuning these models. We demonstrate how universal phoneset acoustic models can leverage cross-lingual supervision to improve transfer of pretrained self-supervised representations to new languages. We also show how target-language text can be used to enable and improve fine-tuning with the lattice-free maximum mutual information (LF-MMI… 

Figures and Tables from this paper

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.


Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?
The Hidden-Unit BERT (HUBERT) model is proposed which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model and allows the pre- training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality.
Semi-Supervised end-to-end Speech Recognition via Local Prior Matching
This work proposes local prior matching (LPM), a semi-supervised objective that distills knowledge from a strong prior to provide learning signal to an end-to-end model trained on unlabeled speech and demonstrates that LPM is simple to implement and superior to existing knowledge distillation techniques under comparable settings.
Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition
This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data.
Investigating Self-Supervised Pre-Training for End-to-End Speech Translation
It is shown that self-supervised pre-training is particularly efficient in low resource settings and that fine-tuning CPC models on the AST training data further improves performance, and that ensembling AST models trained with filter-bank and CPC representations leads to near state-of-the-art models without using any ASR pre- training.
Sequence-Based Multi-Lingual Low Resource Speech Recognition
It is shown that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss and can be adapted cross-lingually to an unseen language using just 25% of the target data.
Unsupervised Pretraining Transfers Well Across Languages
It is shown that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining, shows the potential of unsupervised methods for languages with few linguistic resources.
Injecting Text in Self-Supervised Speech Pretraining
The proposed method, tts4pretrain complements the power of contrastive learning in selfsupervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text.
Self-Training for End-to-End Speech Recognition
  • Jacob Kahn, Ann Lee, Awni Y. Hannun
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition.
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
This paper shows that pre-training on unlabeled in-domain data reduces the gap between models trained on in- domain and out-of-domain labeled data by 66%-73% and improves generalization performance on domains not seen during training.
Language-invariant Bottleneck Features from Adversarial End-to-end Acoustic Models for Low Resource Speech Recognition
  • Jiangyan Yi, J. Tao, Ye Bai
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The results show that the target model trained with the proposed language-invariant bottleneck features outperforms the target modeled with the conventional multilingual bottleneck features by up to 9.7% relative word error rate reduction.