Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model

  title={Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model},
  author={Apoorv Vyas and Srikanth R. Madikeri and Herv{\'e} Bourlard},
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including outof-domain (Switchboard) and cross-lingual (Babel) scenarios. Our… 

Tables from this paper

Improving Low-Resource Speech Recognition with Pretrained Speech Models: Continued Pretraining vs. Semi-Supervised Training

This paper investigates continued pretraining (CoPT) with unlabeled in-language audio data on the XLSR-53 pretrained model in several low-resource languages and shows CoPT results in word error rates (WERs), equal to or slightly better than using SST.

How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications

This work analyzes the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications and analyzes WERs on the low-resource scenario and gender bias carried by one ATC dataset.

Boosting Cross-Domain Speech Recognition with Self-Supervision

This work presents a sys- tematic UDA framework to fully utilize the unlabeled data with self-supervision in the pre-training and fine-tuning paradigm to effectively boost the cross-domain performance and outperform previous approaches.

Towards Better Domain Adaptation for Self-Supervised Models: A Case Study of Child ASR

This paper proposes a novel framework, domain responsible adaptation and finetuning (DRAFT), to reduce domain shifting in pretrained speech models, and evaluates it for a causal and non-causal transformer.

A Comparison of Hybrid and End-to-End ASR Systems for the IberSpeech-RTVE 2020 Speech-to-Text Transcription Challenge

A comparison between hybrid and end-to-end Automatic Speech Recognition (ASR) systems, which were evaluated on the IberSpeech-RTVE 2020 Speech- to-Text Transcription Challenge, finds that when including DAT techniques, a relative WER improvement of 2.87% was obtained as compared to the PyChain-based system.

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

The results indicate that using richer forms of Automatic Speech Recognition outputs al-lows SLU systems to improve in comparison to the 1-best setup, and crossmodal architectures represent a good alternative to overcome the limitations of working purely automatically generated textual data.

Are disentangled representations all you need to build speaker anonymization systems?

Evaluation done using the VoicePrivacy 2022 toolkit showed that vector quantization helps conceal the original speaker identity while maintaining utility for speech recognition.



Improving LSTM-CTC based ASR performance in domains with limited training data

The results show that with effective combination of data augmentation and regularization, a LSTM-CTC based system can exceed the performance of a strong Kaldi based baseline trained on the same data.

Lattice-Free Mmi Adaptation of Self-Supervised Pretrained Acoustic Models

The results show that fine-tuning with LFMMI consistently obtain relative WER improvements of 10% and 35.3% on the clean and other test sets of Librispeech (100h), 10.8% on Switchboard (300h), and 4.4% on Tagalog (84h) compared to the baseline trained only with supervised data.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR

This study investigates flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition and proposes a standalone system, which achieves word error rates comparable with that of the state-of-the-art multi-stage systems while being much faster to prepare.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

Evaluation of Feature-Space Speaker Adaptation for End-to-End Acoustic Models

Experimental results on the TED-LIUM corpus demonstrate that speaker adaptation, applied in combination with data augmentation techniques, provides, in an unsupervised adaptation mode, up to 11--20% of relative word error rate reduction over the baseline model built on the raw filter-bank features.

Forget a Bit to Learn Better: Soft Forgetting for CTC-Based Automatic Speech Recognition

The experiments on the 300-hour English Switchboard dataset show that soft forgetting improves the word error rate (WER) and improves the WER when the model is used with limited temporal context for streaming recognition, and some empirical insights into the regularization and data augmentation effects of soft forgetting are presented.

Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED

Using comparable systems over the five Option Period 1 languages indicates a strong correlation between ASR performance and KWS performance, and the approaches described show consistent trends over the languages investigated to date.

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

A self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration, is introduced and it is shown the proposed method is transferable to downstream datasets not used in pre- training.

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

It is found that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeling and labeled data come from different domains.