BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

  title={BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition},
  author={Yu Zhang and Daniel S. Park and Wei Han and James Qin and Anmol Gulati and Joel Shor and Aren Jansen and Yuanzhong Xu and Yanping Huang and Shibo Wang and Zongwei Zhou and Bo Li and Min Ma and William Chan and Jiahui Yu and Yongqiang Wang and Liangliang Cao and Khe Chai Sim and Bhuvana Ramabhadran and Tara N. Sainath and Franccoise Beaufays and Zhifeng Chen and Quoc V. Le and Chung-Cheng Chiu and Ruoming Pang and Yonghui Wu},
  journal={IEEE Journal of Selected Topics in Signal Processing},
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34 k hours of labeled data, by fine-tuning an 8 billion… 

Robust Speech Recognition via Large-Scale Weak Supervision

The capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet are studied, with results that are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning.

Pseudo Label Is Better Than Human Label

This paper shows that a strong teacher model can be trained to produce high quality pseudo labels by utilizing recent self-supervised and semi-super supervised learning techniques and can achieve 13.6% relative WER reduction for a streaming model compared to using human labels.

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A new pre-trained model, WavLM, is proposed, to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

XLS-R is presented, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0 that improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average.

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.

Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

This work proposes a pre-training approach called wav2vec-S, where it uses task-specific semi-supervised pretraining to re-trained the self-super supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pre- trained model to generate task- Speci-c representations for ASR.

Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

This paper attempts to improve the multilingual ASR performance by transferring knowledge across learning processes itself as compared to transferring through final model parameters, by minimizing an objective related to expected gradient path length.

TRILLsson: Distilled Universal Paralinguistic Speech Representations

This work publicly releases a collection of paralinguistic speech models that are small and near state-of-the-art performance, and is based on knowledge distillation, and these models are distilled on public data only.

Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

A novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models is proposed that can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed.

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

It is demonstrated that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST 2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks.



Effectiveness of self-supervised pre-training for speech recognition

This work directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model, demonstrating that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Lessons from Building Acoustic Models with a Million Hours of Speech

  • S. ParthasarathiN. Strom
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, they obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions.

Specaugment on Large Scale Datasets

  • Daniel S. ParkYu Zhang Yonghui Wu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper demonstrates its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset and introduces a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks.

Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition

This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data.

Toward Domain-Invariant Speech Recognition via Large Scale Training

This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains, and shows that by using as little as 10 hours of data from a new domain, an adapted domain- Invariants model can match performance of a domain-specific model trained from scratch using 70 times as much data.

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

This work studies pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions, and reaches a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

Iterative Pseudo-Labeling for Speech Recognition

This work studies Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves, and demonstrates the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets.

Self-Training for End-to-End Speech Recognition

  • Jacob KahnAnn LeeAwni Y. Hannun
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition.