• Corpus ID: 239024487

Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning

  title={Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning},
  author={Yi-Chen Chen and Shu-wen Yang and Cheng-Kuang Lee and S. See and Hung-yi Lee},
Speech representation learning plays a vital role in speech processing. Among them, self-supervised learning (SSL) has become an important research direction. It has been shown that an SSL pretraining model can achieve excellent performance in various downstream tasks of speech processing. On the other hand, supervised multi-task learning (MTL) is another representation learning paradigm, which has been proven effective in computer vision (CV) and natural language processing (NLP). However… 

Figures from this paper

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

This paper exploits and analyzes a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge, and investigates data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning.

Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation

The latent representation learning and the masks estimation are considered as two pre-tasks in the training stage and the NOISEX and DAPS corpora are used to evaluate the efficacy of the proposed method, which also outperforms the state-of theart methods.

Parameter Efficient Transfer Learning for Various Speech Processing Tasks

This work proposes a new adapter architecture to acquire feature representations more flexi-bly for various speech tasks, and in experiments, this adapter performed on par or better than na¨ıve fine-tuning, with only 11% of learnable parameters.

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.

Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Both ablation and comparison experimental results show that the proposed self-supervised learning based monaural speech enhancement method clearly outperforms the state-of-the-art approaches.

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

S 3 -Router can serve as an all-in-one technique to enable a new finetuning scheme, an efficient multilingual/multitask solution, a state-of-the-art ASR pruning technique, and a new tool to quantitatively analyze the learned speech representation.

Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

This work proposes a pre-training approach called wav2vec-S, where it uses task-specific semi-supervised pretraining to re-trained the self-super supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pre- trained model to generate task- Speci-c representations for ASR.



Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.

Multi-Task Self-Supervised Learning for Robust Speech Recognition

PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.

SUPERB: Speech processing Universal PERformance Benchmark

A simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model for its preferable re-usability and results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SuperB tasks.

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

A self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration, is introduced and it is shown the proposed method is transferable to downstream datasets not used in pre- training.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

SpeechNet: A Universal Modularized Model for Speech Processing Tasks

A universal modularized model, SpeechNet, which contains the five basic modules for speech processing, and will release the code and experimental settings to facilitate the research of modularized universal models or multi-task learning of speech processing tasks.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Experiments on ATIS show that the SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

An Unsupervised Autoregressive Model for Speech Representation Learning

Speech representations learned by the proposed unsupervised autoregressive neural model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsuper supervised approaches.

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

The Hidden-Unit BERT (HUBERT) model is proposed which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model and allows the pre- training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality.