Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models

@article{Deng2021ImprovingHC,
  title={Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models},
  author={Keqi Deng and Songjun Cao and Yike Zhang and Long Ma},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2021},
  pages={76-82}
}
  • Keqi Deng, Songjun Cao, Long Ma
  • Published 13 December 2021
  • Computer Science
  • 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pretraining methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize… 

Figures and Tables from this paper

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

TLDR
The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks and a novel modality conversion mechanism, which is more suitable for logographic languages.

Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models

TLDR
This work proposes two knowledge transferring methods that leverage pre-trained LMs, such as BERT and GPT2, to improve CTC-based models and proposes a joint classification learning method, which combines G PT2 for text modeling with a hybrid CTC/attention architecture.

A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition

TLDR
This work proposes a complementary joint training (CJT) method that trains a model alternatively with two data pairs, and label masking for pseudo-labels and gradient restriction for synthesized audio are proposed to further cope with the deviations from real data.

Improving Deliberation by Text-Only and Semi-Supervised Training

TLDR
This work proposes incorporating text-only and semi-supervised training into an attention-based deliberation model, and shows the proposed deliberation rescorer outperforms a state-of-the-art LM rescoring method, and wins in a human side-by-side evaluation.

Enhancing Speech Recognition Decoding via Layer Aggregation

TLDR
A prediction method is proposed that aggregates the top M layers, potentially leveraging useful information encoded in intermediate layers and relaxing model confidence and showcasing the effectiveness of the approach via beam search decoding.

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

TLDR
A real-time encoder states revision strategy to modify previous states is introduced and a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the proposed revision strategy.

References

SHOWING 1-10 OF 45 REFERENCES

Joint CTC/attention decoding for end-to-end speech recognition

TLDR
This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding.

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

TLDR
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

  • Yinghui HuangH. Kuo M. Picheny
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system, and investigated two techniques to improve the S2 I system, including transfer learning and data augmentation, which recover 80% of performance lost due to using limited intent-labeled speech.

Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data

TLDR
A unified method called LST (Learn Spelling from Teachers) to integrate knowledge into an AED model from the external text-only data and leverage the whole context in a sentence is proposed.

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Text Data

This paper presents a method to pre-train transformer-based encoder-decoder automatic speech recognition (ASR) models using sufficient target-domain text. During pre-training, we train the

Joint Masked CPC And CTC Training For ASR

TLDR
This paper demonstrates a single-stage training of ASR models that can utilize both unlabeled and labeled data and postulates that solving the contrastive task is a regularization for the supervised CTC loss.

Non-autoregressive Transformer-based End-to-end ASR using BERT

TLDR
A non-autoregressive Transformer-based end-to-end ASR model based on BERT is proposed and a series of experiments on the AISHELL-1 dataset are conducted that demonstrate competitive or superior results for the model when compared to state-of-the-art ASR systems.

On Scaling Contrastive Representations for Low-Resource Speech Recognition

TLDR
It is found that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer.

History Utterance Embedding Transformer LM for Speech Recognition

TLDR
The history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main TransformerLM for current prediction, and the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterments while supporting GPU parallel training.

Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

TLDR
This work leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation, and outperforms other LM application approaches such as n-best rescoring and shallow fusion, while it does not require extra inference cost.