Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

@inproceedings{Hsu2021RobustW2,
  title={Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training},
  author={Wei-Ning Hsu and Anuroop Sriram and Alexei Baevski and Tatiana Likhomanenko and Qiantong Xu and Vineel Pratap and Jacob Kahn and Ann Lee and Ronan Collobert and Gabriel Synnaeve and Michael Auli},
  booktitle={Interspeech},
  year={2021}
}
Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we ex-plore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data… 

Tables from this paper

Boosting Cross-Domain Speech Recognition with Self-Supervision
TLDR
This work presents a systematic UDA framework to fully utilize the unlabeled data with self-supervision in the pre-training and fine-tuning paradigm and introduces a two-step PL approach to incorporate target domain linguistic knowledge, thus generating more accurate target domain pseudo-labels.
Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition
TLDR
The effect of domain, language, dataset size and other aspects of upstream pre-training SSL data on the low-resource downstream ASR task is studied and the continued pre- training paradigm is built on to study the effect of prior knowledge possessed by models trained using SSL.
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations
TLDR
PADA is proposed, and redundant weights from models pre-trained on large amounts of out-of-domain (OOD) data are zeroed out to make space for the target-domain ASR finetuning.
Deploying self-supervised learning in the wild for hybrid automatic speech recognition
TLDR
The experimental results show that SSL pre-training with in-domain uncurated data can achieve better performance in comparison to all the alternative out-domain pre- training strategies.
On the Use of External Data for Spoken Named Entity Recognition
TLDR
This work considers self-training, knowledge distillation, and transfer learning for end-to-end (E2E) and pipeline (speech recognition followed by text NER) approaches and finds that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations.
Self-Supervised Speech Representation Learning: A Review
TLDR
This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications
TLDR
This work analyzes the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a completely unseen domain, i.e., air traffic control (ATC) communications, and benchmarks the proposed models on four challenging ATC test sets.
Layer-Wise Analysis of a Self-Supervised Speech Representation Model
TLDR
This work examines one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools to characterize the evolution of information across model layers, and understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations.
Unsupervised Data Selection via Discrete Speech Representation for ASR
TLDR
A simple and effective unsupervised data selection method which selects acoustically similar speech to a target domain is proposed which takes the discrete speech representation available in common self-supervised learning frameworks as input, and applies a contrastiveData selection method on the discrete tokens.
Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch
TLDR
It is found that the most important factors for positive transfer to downstream speech recognition tasks include a substantial amount of data and a matching pre-training domain.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training
TLDR
DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model’s uncertainty about its prediction, is proposed.
Effectiveness of self-supervised pre-training for speech recognition
TLDR
This work directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model, demonstrating that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data.
wav2vec: Unsupervised Pre-training for Speech Recognition
TLDR
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
TLDR
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation
TLDR
This paper presents TED-LIUM release 3 corpus, which multiplies the available data to train acoustic models in English, by a factor of more than two, and presents the recent development on Automatic Speech Recognition (ASR) systems in comparison with the two previous releases.
Libri-Light: A Benchmark for ASR with Limited or No Supervision
TLDR
A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, derived from open-source audio books from the LibriVox project, which is, to the authors' knowledge, the largest freely-available corpus of speech.
A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition
TLDR
The purpose of this paper is to quantify and characterize the performance gap between the two domains, setting up the basis for studying adaptation of speech recognizers from close-talking speech to distant speech.
Rethinking Evaluation in ASR: Are Our Models Robust Enough?
TLDR
It is demonstrated that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world data.
...
...