Corpus ID: 236469084

Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition

  title={Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition},
  author={Samuel Kessler and Bethan Thomas and Salah Karout},
We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL) and applying these for automatic speech recognition. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and finetuning on a small annotated datasets is a promising direction to build speech recognition systems. wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of… Expand

Figures and Tables from this paper

Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis
  • Mu Yang, Shaojin Ding, Tianlong Chen, Tong Wang, Zhangyang Wang
  • Computer Science, Engineering
  • ArXiv
  • 2021
This work forms the replay process as a supervised learning problem, and proposes a simple yet effective dual-sampler framework to tackle the heavily language-imbalanced training samples and shows that this supervised learning formulation outperforms other gradient-based and regularization-based lifelong learning methods. Expand
Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0
A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model, which is trained on fifty-three languages, on the target language ASR task. Expand


Effectiveness of Self-Supervised Pre-Training for ASR
This work directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model, demonstrating that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data. Expand
wav2vec: Unsupervised Pre-training for Speech Recognition
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data. Expand
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition. Expand
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition and the algorithm uses a gumbel softmax or online k-means clustering to quantize the dense representations. Expand
Continual Learning in Automatic Speech Recognition
This work emulates continual learning observed in real life, where new training data are used for gradual improvement of an Automatic Speech Recognizer trained on old domains and appears to yield slight advantage over offline multi-condition training. Expand
Unsupervised Pretraining Transfers Well Across Languages
It is shown that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining, shows the potential of unsupervised methods for languages with few linguistic resources. Expand
Self-Training and Pre-Training are Complementary for Speech Recognition
  • Qiantong Xu, Alexei Baevski, +5 authors Michael Auli
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
P pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups to improve speech recognition systems using unlabeled data. Expand
Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition
The A2 framework overcomes the long-tail problem via three techniques: exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; proposing dual adapters consisting of both language-specific and language-agnostic adaptation with minimal additional parameters; and overcoming the class imbalance. Expand
Transformers with convolutional context for ASR
This paper proposes replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations that provide subsequent transformer blocks with relative positional information needed for discovering long-range relationships between local concepts. Expand
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing. Expand