• Corpus ID: 211258986

Semi-Supervised Speech Recognition via Local Prior Matching

  title={Semi-Supervised Speech Recognition via Local Prior Matching},
  author={Wei-Ning Hsu and Ann Lee and Gabriel Synnaeve and Awni Y. Hannun},
For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability. In this work, we propose local prior matching (LPM), a semi-supervised objective that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to a discriminative model trained on unlabeled speech. We demonstrate that LPM is theoretically well-motivated… 
Large scale weakly and semi-supervised learning for low-resource video ASR
A large scale systematic comparison between two self-labeling methods, and weakly-supervised pretraining using contextual metadata on the challenging task of transcribing social media videos in low-resource conditions is conducted.
Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification
Results show that this approach can effectively exploit an N- best list of pseudo-labels with associated scores, considerably outperforming standard pseudo-labeling, with ASR results approaching an oracle experiment in which the best hypotheses of the N-best lists are selected manually.
Unsupervised Speech Recognition
Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3 and rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
Towards Semi-Supervised Semantics Understanding from Speech
Experiments show that the proposed SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.
Iterative Pseudo-Labeling for Speech Recognition
This work studies Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves, and demonstrates the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets.
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. We
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
A novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed, which reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data.
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training
DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model’s uncertainty about its prediction, is proposed.
Joint Masked CPC And CTC Training For ASR
This paper demonstrates a single-stage training of ASR models that can utilize both unlabeled and labeled data and postulates that solving the contrastive task is a regularization for the supervised CTC loss.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
It is shown that the combination of pretraining, self-training and scaling up model size greatly increases dataency, even for extremely large tasks with tens of thousands of hours of labeled data, as well as obtaining SoTA performance on many public benchmarks.


Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation
This paper addresses the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech.
Self-Training for End-to-End Speech Recognition
  • Jacob KahnAnn LeeAwni Y. Hannun
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition.
Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
This work proposes a new semi-supervised loss combining an end-to-end differentiable ASR loss that is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of \%WER.
Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI
Various extensions to standard LF-MMI training are described to allow the use as supervision of lattices obtained via decoding of unsupervised data and different methods for splitting the lattices and incorporating frame tolerances into the supervision FST are investigated.
Semi-supervised Training for End-to-end Models via Weak Distillation
A Part-of-Speech (POS) tagger is adopted to filter the unsupervised data to use only those with proper nouns and it is shown that training with filtered unsuper supervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up toA 17% relative improvement.
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error
This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.
Semi-Supervised End-to-End Speech Recognition
We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create
Libri-Light: A Benchmark for ASR with Limited or No Supervision
A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, derived from open-source audio books from the LibriVox project, which is, to the authors' knowledge, the largest freely-available corpus of speech.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. ChiuT. Sainath M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.