Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

@inproceedings{Ao2022PreTrainingTD,
  title={Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data},
  author={Junyi Ao and Zi-Hua Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei},
  booktitle={Interspeech},
  year={2022}
}
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct… 

Figures and Tables from this paper

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

A unified-modal speech-unit-text pre-training model to connect the representations of a speech encoder and a text decoder with a shared unit encoder, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks.

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

A novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data, and introduces the phoneme modality into pre- training to help capture modality-invariant information between Mandarinspeech and text.

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

A novel technique to obtain better downstream ASR performance from a joint encoder-decoder self-supervised model when trained with speech pooled from two different channels (narrow and wide band) and proposes non-overlapping cluster IDs for speech from different channels.

The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

Experimental results show that the YiTrans system obtains a sig-nificant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year’s op-timal end-to-end system on tst2021 English-German.

Pre-training for Speech Translation: CTC Meets Optimal Transport

This work shows that the connectionist temporal classification (CTC) loss can reduce the modality gap by design, and proposes a novel pre-training method combining CTC and optimal transport to further reduce this gap.

The YiTrans Speech Translation System for IWSLT 2022 Offline Shared Task

Experimental results show that the YiTrans system obtains a significant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year’s optimal end-to-end system on tst2021 English-German.

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

The Co de BERT (CoBERT) approach for self-supervised speech representation learning outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task.

Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings

WavEmbed is proposed, a multimodal sequential autoencoder that predicts hidden units from a dense representation of speech and S-HuBERT to induce meaning through knowledge distillation, in which a sentence embedding model is first trained on hidden units and passes its knowledge to a speech encoder through contrastive learning.

Speech Corpora Divergence Based Unsupervised Data Selection for ASR

A unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between twospeech corpora and can focus on more acoustic details and guarantee the diversity of the selected set.

Improving Automatic Speech Recognition for Low-Resource Language by Data Augmentation

This study focuses on the data augmentation approach to deal with the small-size datasets to help the deep learning network better coverage in the ASR task.

References

SHOWING 1-10 OF 26 REFERENCES

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data

This paper presents a method to pre-train transformer-based encoder-decoder automatic speech recognition (ASR) models using sufficient target-domain text. During pre-training, we train the

Unsupervised pre-traing for sequence to sequence speech recognition

A novel approach to pre-train encoder-decoder sequence-to-sequence (seq2seq) model with unpaired speech and transcripts respectively, which will benefit downstream automatic speech recognition (ASR) tasks.

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A new pre-trained model, WavLM, is proposed, to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.

Effectiveness of self-supervised pre-training for speech recognition

This work directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model, demonstrating that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data.

Direct Speech-to-Speech Translation With Discrete Units

A direct speech-to-speech translation model that translates speech from one language to speech in another language without relying on intermediate text generation is presented and is comparable to models that predict spectrograms and are trained with text supervision.

Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction

It is found that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeling and labeled data come from different domains.

Generative Pre-Training for Speech with Autoregressive Predictive Coding

  • Yu-An ChungJames R. Glass
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper proposes to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.