Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

  title={Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration},
  author={Shigeki Karita and Nelson Yalta and Shinji Watanabe and Marc Delcroix and Atsunori Ogawa and Tomohiro Nakatani},
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-tosequence transformation tasks. [] Key Method To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult.

Figures and Tables from this paper

Streaming Automatic Speech Recognition with the Transformer Model

This work proposes a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word, and applies time-restricted self-attention for the encoder and triggered attention for theEncoder-decoder attention mechanism.

A Comparative Study on Transformer vs RNN in Speech Applications

An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.

A study of transformer-based end-to-end speech recognition system for Kazakh language

It was revealed that the joint use of Transformer and connectionist temporal classification models contributed to improving the performance of the Kazakh speech recognition system and with an integrated language model it showed the best character error rate 3.7% on a clean dataset.

Simplified Self-Attention for Transformer-Based end-to-end Speech Recognition

A simplified SSAN-based transformer model which employs FSMN memory blocks instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition and shows no loss of recognition performance on the 20,000-hour large-scale Mandarin tasks.

Cross Attention with Monotonic Alignment for Speech Transformer

This paper presents an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights, and introduces a regularizer for alignment regularization.

Attention-Based ASR with Lightweight and Dynamic Convolutions

This paper proposes to apply lightweight and dynamic convolution to E2E ASR as an alternative architecture to the self-attention to make the computational order linear and proposes joint training with connectionist temporal classification, convolution on the frequency axis, and combination with self-Attention.

End-To-End Multi-Speaker Speech Recognition With Transformer

This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

To develop the performance of an E2E agglutinative language speech recognition system, a new feature extractor is proposed, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales and is superior to VGGnet.

Hyperparameter experiments on end-to-end automatic speech recognition*

This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed and whichHyperparameters are altered in the encoder and decoder networks.

Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges

A speaker classifier with a gradient reversal layer is included in the training phase to improve the robustness to speaker variation and build and compare end-to-end (E2E) systems and Deep Neural Network Hidden Markov Model (DNN-HMM) systems.



Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

  • Linhao DongShuang XuBo Xu
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer.

End-to-end attention-based large vocabulary speech recognition

This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

Joint CTC-attention based end-to-end speech recognition using multi-task learning

A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into

Self-attention Networks for Connectionist Temporal Classification in Speech Recognition

This work proposes SAN-CTC, a deep, fully self-attentional network for CTC, and shows it is tractable and competitive for end-to-end speech recognition, and explores how label alphabets affect attention heads and performance.

End-to-end Speech Recognition With Word-Based Rnn Language Models

A novel word-based RNN-LM is proposed, which allows us to decode with only the word- based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM.

Improved training of end-to-end attention models for speech recognition

This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.

Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

This work learns to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network and beats out traditional hybrid ASR systems on spontaneous Japanese and Chinese speech.

Very deep convolutional networks for end-to-end speech recognition

This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.