Corpus ID: 237941095

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

@article{Zhang2021BigSSLET,
  title={BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition},
  author={Yu Zhang and Daniel S. Park and Wei Han and James Qin and Anmol Gulati and Joel Shor and Aren Jansen and Yuanzhong Xu and Yanping Huang and Shibo Wang and Zongwei Zhou and Bo Li and Min Ma and William Chan and Jiahui Yu and Yongqiang Wang and Liangliang Cao and Khe Chai Sim and Bhuvana Ramabhadran and Tara N. Sainath and Franccoise Beaufays and Zhifeng Chen and Quoc V. Le and Chung-Cheng Chiu and Ruoming Pang and Yonghui Wu},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.13226}
}
  • Yu Zhang, Daniel S. Park, +23 authors Yonghui Wu
  • Published 27 September 2021
  • Computer Science, Engineering
  • ArXiv
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pretraining, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion… Expand
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours ofExpand
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training
  • Ankur Bapna, Yu-an Chung, +7 authors Yu Zhang
  • Computer Science
  • ArXiv
  • 2021
TLDR
It is demonstrated that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST 2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks. Expand
Multilingual Speech Recognition using Knowledge Transfer across Learning Processes
TLDR
This paper attempts to improve the multilingual ASR performance by transferring knowledge across learning processes itself as compared to transferring through final model parameters, by minimizing an objective related to expected gradient path length. Expand
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training
TLDR
This paper aims to improve the existing SSL framework for speaker representation learning, and introduces an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training. Expand
Recent Advances in End-to-End Automatic Speech Recognition
  • Jinyu Li
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective. Expand
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
TLDR
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods. Expand

References

SHOWING 1-10 OF 102 REFERENCES
Lessons from Building Acoustic Models with a Million Hours of Speech
  • S. Parthasarathi, N. Strom
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
The experiments show that extremely large amounts of data are indeed useful; with little hyper-parameter tuning, they obtain relative WER improvements in the 10 to 20% range, with higher gains in noisier conditions. Expand
Specaugment on Large Scale Datasets
  • Daniel S. Park, Y. Zhang, +5 authors Yonghui Wu
  • Engineering, Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper demonstrates its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset and introduces a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks. Expand
Toward Domain-Invariant Speech Recognition via Large Scale Training
TLDR
This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains, and shows that by using as little as 10 hours of data from a new domain, an adapted domain- Invariants model can match performance of a domain-specific model trained from scratch using 70 times as much data. Expand
Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition
TLDR
This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. Expand
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
TLDR
SpeechStew is a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal, and it is demonstrated that SpeechStew learns powerful transfer learning representations. Expand
wav2vec: Unsupervised Pre-training for Speech Recognition
TLDR
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data. Expand
Self-Training for End-to-End Speech Recognition
  • Jacob Kahn, Ann Lee, Awni Y. Hannun
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition. Expand
Analysis of low-resource acoustic model self-training
TLDR
Although the small 5k vocabulary raises WER by 2% absolute, self-training is equally effective as using a large 75k vocabulary and adding all 75k words to the decoding vocabulary after self- training reduces the WER degradation to only 0.8% absolute. Expand
Semi-supervised Training for End-to-end Models via Weak Distillation
TLDR
A Part-of-Speech (POS) tagger is adopted to filter the unsupervised data to use only those with proper nouns and it is shown that training with filtered unsuper supervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up toA 17% relative improvement. Expand
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of theExpand
...
1
2
3
4
5
...