Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

@inproceedings{Pascual2019LearningPS,
  title={Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks},
  author={Santiago Pascual and Mirco Ravanelli and Joan Serr{\`a} and Antonio Bonafonte and Yoshua Bengio},
  booktitle={INTERSPEECH},
  year={2019}
}
Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. [] Key Method The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn…

Figures and Tables from this paper

Multi-Task Self-Supervised Learning for Robust Speech Recognition
TLDR
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Self-Supervised Speech Representation Learning: A Review
TLDR
This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
Multi-Task Self-Supervised Pre-Training for Music Classification
TLDR
This paper applies self-supervised and multi-task learning methods for pre-training music encoders, and explores various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks to investigate how these design choices interact with various downstream music classification tasks.
Improving Self-Supervised Speech Representations by Disentangling Speakers
TLDR
This paper proposes a new SSL method that can achieve speaker disentanglement without severe loss of content, and incorporates disentangling mechanisms to regularize both the teachers and the students (learned representations).
Does Visual Self-Supervision Improve Learning of Speech Representations?
TLDR
The results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self- supervision leads to more informative speech representations.
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
TLDR
A self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration, is introduced and it is shown the proposed method is transferable to downstream datasets not used in pre- training.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
TLDR
The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
Self-Supervised Learning for speech recognition with Intermediate layer supervision
TLDR
Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers and explains the success of the method for ASR.
SUPERB: Speech Understanding and PERformance Benchmark
TLDR
The speech processing community lacks a similar setup that systematically measures the quality of learned representations across a wide range of downstream speech applications, so SUPERB is a leaderboard to benchmark the performance of learned speech representations on ten speech processing tasks.
Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning
TLDR
It is shown that MTL finetuning can further improve SSL pretraining and analysis of the generalizability of supervised MTL fine- tuning is analyzed to examine if the speech representation learned by MTL Finet tuning can generalize to unseen new tasks.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
Learning Speaker Representations with Mutual Information
TLDR
This work learns representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence.
Neural Discrete Representation Learning
TLDR
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Unsupervised Speech Representation Learning Using WaveNet Autoencoders
TLDR
A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
Deep Learning of Representations for Unsupervised and Transfer Learning
  • Yoshua Bengio
  • Computer Science
    ICML Unsupervised and Transfer Learning
  • 2012
TLDR
Why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where the authors care about predictions on examples that are not from the same distribution as the training distribution.
Multi-task Self-Supervised Visual Learning
TLDR
The results show that deeper networks work better, and that combining tasks—even via a na¨ýve multihead architecture—always improves performance.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Light Gated Recurrent Units for Speech Recognition
TLDR
This paper revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and proposes a simplified architecture that turned out to be very effective for ASR, and proposes to replace hyperbolic tangent with rectified linear unit activations.
Representation Learning with Contrastive Predictive Coding
TLDR
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
Speaker Recognition from Raw Waveform with SincNet
TLDR
This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.
SEGAN: Speech Enhancement Generative Adversarial Network
TLDR
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
...
...