Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

  title={Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models},
  author={Keqi Deng and Songjun Cao and Yike Zhang and Long Ma},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  • Keqi Deng, Songjun Cao, Long Ma
  • Published 13 December 2021
  • Computer Science
  • 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pretraining methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize… 

Figures and Tables from this paper

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models
The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks and a novel modality conversion mechanism, which is more suitable for logographic languages.
Improving CTC-based speech recognition via knowledge transferring from pre-trained language models
This work proposes two knowledge transferring methods that leverage pre-trained LMs, such as BERT and GPT2, to improve CTC-based models and proposes a joint classification learning method, which combines G PT2 for text modeling with a hybrid CTC/attention architecture.
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition
This work proposes a complementary joint training (CJT) method that trains a model alternatively with two data pairs, and label masking for pseudo-labels and gradient restriction for synthesized audio are proposed to further cope with the deviations from real data.
Improving Deliberation by Text-Only and Semi-Supervised Training
This work proposes incorporating text-only and semi-supervised training into an attention-based deliberation model, and shows the proposed deliberation rescorer outperforms a state-of-the-art LM rescoring method, and wins in a human side-by-side evaluation.
Enhancing Speech Recognition Decoding via Layer Aggregation
A prediction method is proposed that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence and showcasing the effectiveness of the approach via beam search decoding.
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
A real-time encoder states revision strategy to modify previous states is introduced and a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the proposed revision strategy.


Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition
A fused acoustic encoder and linguistic encoder is fused into an end-to-end ASR model that achieves better recognition performance on CALLHOME corpus and a scheduled fine-tuning strategy is proposed to preserve and utilize the text context modeling ability of the pre-trained linguistic encoding.
Joint CTC/attention decoding for end-to-end speech recognition
This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding.
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems
  • Yinghui Huang, H. Kuo, M. Picheny
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system, and investigated two techniques to improve the S2 I system, including transfer learning and data augmentation, which recover 80% of performance lost due to using limited intent-labeled speech.
Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data
A unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence is proposed.
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data
This paper presents a method to pre-train transformer-based encoder-decoder automatic speech recognition (ASR) models using sufficient target-domain text. During pre-training, we train the
Joint Masked CPC And CTC Training For ASR
This paper demonstrates a single-stage training of ASR models that can utilize both unlabeled and labeled data and postulates that solving the contrastive task is a regularization for the supervised CTC loss.
Applying wav2vec2.0 to Speech Recognition in various low-resource languages
This work applies pre-trained models to solve low-resource speech recognition tasks in various spoken languages to verify its universality over languages and achieves more than 20% relative improvements in six languages compared with previous work.
Non-autoregressive Transformer-based End-to-end ASR using BERT
A non-autoregressive Transformer-based end-to-end ASR model based on BERT is proposed and a series of experiments on the AISHELL-1 dataset are conducted that demonstrate competitive or superior results for the model when compared to state-of-the-art ASR systems.
On Scaling Contrastive Representations for Low-Resource Speech Recognition
It is found that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer.