Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

@inproceedings{Karita2019ImprovingTE,
  title={Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration},
  author={Shigeki Karita and Nelson Yalta and Shinji Watanabe and Marc Delcroix and Atsunori Ogawa and Tomohiro Nakatani},
  booktitle={INTERSPEECH},
  year={2019}
}
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-tosequence transformation tasks. [...] Key Method To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult.Expand
Cross Attention with Monotonic Alignment for Speech Transformer
TLDR
This paper presents an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights, and introduces a regularizer for alignment regularization. Expand
Attention-Based ASR with Lightweight and Dynamic Convolutions
TLDR
This paper proposes to apply lightweight and dynamic convolution to E2E ASR as an alternative architecture to the self-attention to make the computational order linear and proposes joint training with connectionist temporal classification, convolution on the frequency axis, and combination with self-Attention. Expand
End-To-End Multi-Speaker Speech Recognition With Transformer
TLDR
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals. Expand
Jointly Trained Transformers Models for Spoken Language Translation
TLDR
This work trains SLT systems with ASR objective as an auxiliary loss and both the networks are connected through the neural hidden representations and the final BLEU score is on-par with the best speech translation system on How2 dataset without using any additional training data and language model and using fewer parameters. Expand
Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition
TLDR
Experimental results show that Mandarin transformer transducer using syllable with tone achieves the best performance and a new mix-bandwidth training method is presented to obtain a general model that is able to accurately recognize Mandarin speech with different sampling rates simultaneously. Expand
Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning
TLDR
Two unsupervised pre-training strategies for the encoder and the decoder of Transformer respectively are proposed, which make full use of unpaired data for training, and a new semi-supervised fine-tuning method named multi-task semantic knowledge learning is proposed to strengthen the Transformer’s ability to learn about semantic knowledge, thereby improving the system performance. Expand
Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts
TLDR
This paper study end-to-end models for Mandarin-English codeswitching automatic speech recognition, and proposes a bi-encoder transformer network based Mixture of Experts (MoE) architecture to better leverage these data. Expand
Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition
TLDR
Conv-Transformer Transducer architecture, named Conv- transducer, achieves competitive performance on LibriSpeech dataset (3.6\% WER on test-clean) without external language models. Expand
Transformer-Based Long-Context End-to-End Speech Recognition
TLDR
This paper proposes a Transformer-based architecture that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance and investigates how to design the context window and train the model effectively in monologue (one speaker) and dialogue (two speakers) scenarios. Expand
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition
TLDR
A novel multi-encoder learning method that performs a weighted combination of two encoder-decoder multi-head attention outputs only during training and achieves state-of-the-art performance for transformer-based models on WSJ with a significant WER reduction of 19% relative compared to the current benchmark approach. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
  • Linhao Dong, Shuang Xu, Bo Xu
  • Computer Science
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer. Expand
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. Expand
Joint CTC-attention based end-to-end speech recognition using multi-task learning
TLDR
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. Expand
Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese
Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model intoExpand
Self-attention Networks for Connectionist Temporal Classification in Speech Recognition
TLDR
This work proposes SAN-CTC, a deep, fully self-attentional network for CTC, and shows it is tractable and competitive for end-to-end speech recognition, and explores how label alphabets affect attention heads and performance. Expand
End-to-end Speech Recognition With Word-Based Rnn Language Models
TLDR
A novel word-based RNN-LM is proposed, which allows us to decode with only the word- based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM. Expand
Improved training of end-to-end attention models for speech recognition
TLDR
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units. Expand
Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
TLDR
This work learns to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network and beats out traditional hybrid ASR systems on spontaneous Japanese and Chinese speech. Expand
Very deep convolutional networks for end-to-end speech recognition
TLDR
This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
...
1
2
3
...