Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
@inproceedings{Pascual2019LearningPS, title={Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks}, author={Santiago Pascual and Mirco Ravanelli and Joan Serr{\`a} and Antonio Bonafonte and Yoshua Bengio}, booktitle={INTERSPEECH}, year={2019} }
Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. [] Key Method The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn…
146 Citations
Multi-Task Self-Supervised Learning for Robust Speech Recognition
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Self-Supervised Speech Representation Learning: A Review
- Computer ScienceArXiv
- 2022
This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
Multi-Task Self-Supervised Pre-Training for Music Classification
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This paper applies self-supervised and multi-task learning methods for pre-training music encoders, and explores various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks to investigate how these design choices interact with various downstream music classification tasks.
Improving Self-Supervised Speech Representations by Disentangling Speakers
- Computer ScienceArXiv
- 2022
This paper proposes a new SSL method that can achieve speaker disentanglement without severe loss of content, and incorporates disentangling mechanisms to regularize both the teachers and the students (learned representations).
Does Visual Self-Supervision Improve Learning of Speech Representations?
- Computer ScienceArXiv
- 2020
The results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self- supervision leads to more informative speech representations.
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
A self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration, is introduced and it is shown the proposed method is transferable to downstream datasets not used in pre- training.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
Self-Supervised Learning for speech recognition with Intermediate layer supervision
- Computer ScienceICASSP
- 2022
Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers and explains the success of the method for ASR.
SUPERB: Speech Understanding and PERformance Benchmark
- Computer Science
- 2021
The speech processing community lacks a similar setup that systematically measures the quality of learned representations across a wide range of downstream speech applications, so SUPERB is a leaderboard to benchmark the performance of learned speech representations on ten speech processing tasks.
Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning
- Computer ScienceArXiv
- 2021
It is shown that MTL finetuning can further improve SSL pretraining and analysis of the generalizability of supervised MTL fine- tuning is analyzed to examine if the speech representation learned by MTL Finet tuning can generalize to unseen new tasks.
References
SHOWING 1-10 OF 49 REFERENCES
Learning Speaker Representations with Mutual Information
- Computer ScienceINTERSPEECH
- 2019
This work learns representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence.
Neural Discrete Representation Learning
- Computer ScienceNIPS
- 2017
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Unsupervised Speech Representation Learning Using WaveNet Autoencoders
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019
A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
Deep Learning of Representations for Unsupervised and Transfer Learning
- Computer ScienceICML Unsupervised and Transfer Learning
- 2012
Why unsupervised pre-training of representations can be useful, and how it can be exploited in the transfer learning scenario, where the authors care about predictions on examples that are not from the same distribution as the training distribution.
Multi-task Self-Supervised Visual Learning
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
The results show that deeper networks work better, and that combining tasks—even via a na¨ýve multihead architecture—always improves performance.
Unsupervised Learning of Semantic Audio Representations
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Light Gated Recurrent Units for Speech Recognition
- Computer ScienceIEEE Transactions on Emerging Topics in Computational Intelligence
- 2018
This paper revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and proposes a simplified architecture that turned out to be very effective for ASR, and proposes to replace hyperbolic tangent with rectified linear unit activations.
Representation Learning with Contrastive Predictive Coding
- Computer ScienceArXiv
- 2018
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
Speaker Recognition from Raw Waveform with SincNet
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.
SEGAN: Speech Enhancement Generative Adversarial Network
- Computer ScienceINTERSPEECH
- 2017
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.