• Corpus ID: 239885872

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

  title={WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing},
  author={Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Micheal Zeng and Furu Wei},
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multifaceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pretrained model, WavLM, to solve full-stack downstream speech tasks. WavLM extends HuBERT framework to denoising masked… 

Figures and Tables from this paper

Self-Supervised Learning for speech recognition with Intermediate layer supervision
Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers and explains the success of the method for ASR.
Robust Self-Supervised Audio-Visual Speech Recognition
This work presents a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-theart audio-visual speech representation learning model that outperforms prior state- of-the-art AVSR systems on the largest available AVSR benchmark dataset LRS3.


UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training
This paper aims to improve the existing SSL framework for speaker representation learning, and introduces an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training.
Generative Pre-Training for Speech with Autoregressive Predictive Coding
  • Yu-An Chung, James R. Glass
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper proposes to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations.
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
A self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration, is introduced and it is shown the proposed method is transferable to downstream datasets not used in pre- training.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Investigation of Practical Aspects of Single Channel Speech Separation for ASR
This paper investigates a two-stage training scheme that firstly applies a feature level optimization criterion for pretraining, followed by an ASR-oriented optimization criterion using an end-to-end (E2E) speech recognition model, and introduces a modified teacher-student learning technique for model compression.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
  • Yu Zhang, Daniel S. Park, +23 authors Yonghui Wu
  • Computer Science, Engineering
  • 2021
It is found that the combination of pretraining, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data.
wav2vec: Unsupervised Pre-training for Speech Recognition
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
Semantic Mask for Transformer based End-to-End Speech Recognition
This paper proposes a semantic mask based regularization for training such kind of end-to-end (E2E) model, which is to mask the input features corresponding to a particular output token in order to encourage the model to fill the token based on the contextual information.
Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition
This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data.