Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing

  title={Improved Consistency Training for Semi-Supervised Sequence-to-Sequence ASR via Speech Chain Reconstruction and Self-Transcribing},
  author={Heli Qi and Sashi Novitasari and Sakriani Sakti and Satoshi Nakamura},
Consistency regularization has recently been applied to semi-supervised sequence-to-sequence (S2S) automatic speech recognition (ASR). This principle encourages an ASR model to output similar predictions for the same input speech with different perturbations. The existing paradigm of semi-supervised S2S ASR utilizes SpecAugment as data augmentation and requires a static teacher model to produce pseudo transcripts for untranscribed speech. However, this paradigm fails to take full advantage of… 

Figures and Tables from this paper


Sequence-Level Consistency Training for Semi-Supervised End-to-End Automatic Speech Recognition
The experiments show that the semi-supervised learning proposal with sequence-level consistency training can efficiently improve ASR performance using unlabeled speech data.
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation
This work demonstrates the efficacy of two approaches to semi-supervision for automated speech recognition and presents factorized multilingual speech synthesis to improve data augmentation on unspoken text and proposed Sequential MixMatch algorithm with iterative learning to learn from untranscribed speech.
Semi-Supervised Learning with Data Augmentation for End-to-End ASR
This paper focuses on the consistency regularization principle, and presents sequence-to-sequence versions of the FixMatch and Noisy Student algorithms, and generates the pseudo labels for the unlabeled data on-the-fly with a seq2seq model after perturbing the input features with DA.
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
A novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed, which reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data.
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
Pseudo-labeling (PL) has been shown to be effective in semisupervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL
Self-Training for End-to-End Speech Recognition
  • Jacob Kahn, Ann Lee, Awni Y. Hannun
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition.
Machine Speech Chain
To the best of the knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors and significantly improved performance over that from separate systems that were only trained with labeled data.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Improved Noisy Student Training for Automatic Speech Recognition
This work adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method and finding effective methods to filter, balance and augment the data generated in between self-training iterations.
Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
The experimental results show that the ESPnet-TTS models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset.