Multi-Task Self-Supervised Learning for Robust Speech Recognition

@article{Ravanelli2020MultiTaskSL,
  title={Multi-Task Self-Supervised Learning for Robust Speech Recognition},
  author={Mirco Ravanelli and Jianyuan Zhong and Santiago Pascual and Pawel Swietojanski and Jo{\~a}o Monteiro and Jan Trmal and Yoshua Bengio},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={6989-6993}
}
  • M. Ravanelli, Jianyuan Zhong, Yoshua Bengio
  • Published 25 January 2020
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including… 

Figures and Tables from this paper

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR
TLDR
A joint pre-training approach for the SE module and the self-supervised model and a dual-attention fusion method to fuse the features of noisy and enhanced speeches, which can compensate the information loss caused by separately using individual modules are proposed.
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training
TLDR
This paper aims to improve the existing SSL framework for speaker representation learning, and introduces an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training.
Self-Supervised Learning based Monaural Speech Enhancement with Multi-Task Pre-Training
TLDR
A multi-task pre-training method to improve the speech enhancement performance with self-supervised learning and demonstrates that the proposed method outperforms the state-of-the-art approaches.
Self-Supervised Speech Representation Learning: A Review
TLDR
This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
TLDR
A new pre-trained model, WavLM, to solve full-stack downstream speech tasks, which achieves state-of-the-art performance on the SUPERB benchmark, and brings improvements for various speech processing tasks on their representative benchmarks.
Joint Encoder-Decoder Self-Supervised Pre-training for ASR
TLDR
It is hypothesize that the presence of a decoder in the SSL model helps it learn an acoustic unit-based language model, which might improve the performance of an ASR downstream task.
Self-Supervised Learning for speech recognition with Intermediate layer supervision
TLDR
Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers and explains the success of the method for ASR.
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
TLDR
This work trains a ResNet+Conformer model to predict acoustic features from unlabelled visual speech and finds that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments.
Towards Unsupervised Learning of Speech Features in the Wild
TLDR
It is shown that the first two problems in the presence of non-speech data, noisy or low quality speech data, and imbalance in speaker distribution combined can already have a performance cost of up to 30% relative for the ABX score on the Libri-light train set.
A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition
TLDR
Experimental results reveal that the proposed enhanced wav2vec2.0 model can not only improve the ASR performance on the noisy test set which surpasses the originals, but also ensure a tiny performance decrease on the clean test set.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
TLDR
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
Contaminated speech training methods for robust DNN-HMM distant speech recognition
TLDR
This paper revise this classical approach in the context of modern DNN-HMM systems, and proposes the adoption of three methods, namely, asymmetric context windowing, close- talk based supervision, and close-talk based pre-training, which show a significant advantage in using these three methods.
Unsupervised Speech Representation Learning Using WaveNet Autoencoders
TLDR
A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
Learning Speaker Representations with Mutual Information
TLDR
This work learns representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence.
Improving Speech Recognition by Revising Gated Recurrent Units
TLDR
This work proposes to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture and replaces tanh with ReLU activations in the state update equations, and shows that the revised architecture consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
TLDR
The 5th CHiME Challenge is introduced, which considers the task of distant multi-microphone conversational ASR in real home environments and describes the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
The Pytorch-kaldi Speech Recognition Toolkit
TLDR
Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.
The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments
TLDR
A first set of baseline results obtained using different techniques, including Deep Neural Networks (DNN), aligned with the state-of-the-art at international level are reported.
...
...