Semi-Supervised Speech Recognition via Local Prior Matching
@article{Hsu2020SemiSupervisedSR, title={Semi-Supervised Speech Recognition via Local Prior Matching}, author={Wei-Ning Hsu and Ann Lee and Gabriel Synnaeve and Awni Y. Hannun}, journal={ArXiv}, year={2020}, volume={abs/2002.10336} }
For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability. In this work, we propose local prior matching (LPM), a semi-supervised objective that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to a discriminative model trained on unlabeled speech. We demonstrate that LPM is theoretically well-motivated…
Tables from this paper
19 Citations
Large scale weakly and semi-supervised learning for low-resource video ASR
- Computer ScienceINTERSPEECH
- 2020
A large scale systematic comparison between two self-labeling methods, and weakly-supervised pretraining using contextual metadata on the challenging task of transcribing social media videos in low-resource conditions is conducted.
Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
Results show that this approach can effectively exploit an N- best list of pseudo-labels with associated scores, considerably outperforming standard pseudo-labeling, with ASR results approaching an oracle experiment in which the best hypotheses of the N-best lists are selected manually.
Unsupervised Speech Recognition
- Computer ScienceNeurIPS
- 2021
Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3 and rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
Towards Semi-Supervised Semantics Understanding from Speech
- Computer ScienceArXiv
- 2020
Experiments show that the proposed SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.
Iterative Pseudo-Labeling for Speech Recognition
- Computer ScienceINTERSPEECH
- 2020
This work studies Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves, and demonstrates the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets.
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
- Computer ScienceInterspeech
- 2021
Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. We…
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
- Computer ScienceINTERSPEECH
- 2020
A novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed, which reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data.
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model’s uncertainty about its prediction, is proposed.
Joint Masked CPC And CTC Training For ASR
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This paper demonstrates a single-stage training of ASR models that can utilize both unlabeled and labeled data and postulates that solving the contrastive task is a regularization for the supervised CTC loss.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
- Computer ScienceIEEE Journal of Selected Topics in Signal Processing
- 2022
It is shown that the combination of pretraining, self-training and scaling up model size greatly increases dataency, even for extremely large tasks with tens of thousands of hours of labeled data, as well as obtaining SoTA performance on many public benchmarks.
References
SHOWING 1-10 OF 57 REFERENCES
Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
This paper addresses the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech.
Self-Training for End-to-End Speech Recognition
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition.
Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
- Computer ScienceINTERSPEECH
- 2019
This work proposes a new semi-supervised loss combining an end-to-end differentiable ASR loss that is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of \%WER.
Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
Various extensions to standard LF-MMI training are described to allow the use as supervision of lattices obtained via decoding of unsupervised data and different methods for splitting the lattices and incorporating frame tolerances into the supervision FST are investigated.
An unsupervised deep domain adaptation approach for robust speech recognition
- Computer ScienceNeurocomputing
- 2017
Semi-supervised Training for End-to-end Models via Weak Distillation
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
A Part-of-Speech (POS) tagger is adopted to filter the unsupervised data to use only those with proper nouns and it is shown that training with filtered unsuper supervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up toA 17% relative improvement.
Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2007
This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.
Semi-Supervised End-to-End Speech Recognition
- Computer ScienceINTERSPEECH
- 2018
We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create…
Libri-Light: A Benchmark for ASR with Limited or No Supervision
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, derived from open-source audio books from the LibriVox project, which is, to the authors' knowledge, the largest freely-available corpus of speech.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.