Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
@inproceedings{Karita2019ImprovingTE, title={Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration}, author={Shigeki Karita and Nelson Yalta and Shinji Watanabe and Marc Delcroix and Atsunori Ogawa and Tomohiro Nakatani}, booktitle={Interspeech}, year={2019} }
The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-tosequence transformation tasks. [] Key Method To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult.
151 Citations
Streaming Automatic Speech Recognition with the Transformer Model
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work proposes a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word, and applies time-restricted self-attention for the encoder and triggered attention for theEncoder-decoder attention mechanism.
A Comparative Study on Transformer vs RNN in Speech Applications
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
A study of transformer-based end-to-end speech recognition system for Kazakh language
- Computer ScienceScientific reports
- 2022
It was revealed that the joint use of Transformer and connectionist temporal classification models contributed to improving the performance of the Kazakh speech recognition system and with an integrated language model it showed the best character error rate 3.7% on a clean dataset.
Simplified Self-Attention for Transformer-Based end-to-end Speech Recognition
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
A simplified SSAN-based transformer model which employs FSMN memory blocks instead of projection layers to form query and key vectors for transformer-based end-to-end speech recognition and shows no loss of recognition performance on the 20,000-hour large-scale Mandarin tasks.
Cross Attention with Monotonic Alignment for Speech Transformer
- Computer ScienceINTERSPEECH
- 2020
This paper presents an effective cross attention biasing technique in transformer that takes monotonic alignment between text output and speech input into consideration by making use of cross attention weights, and introduces a regularizer for alignment regularization.
Attention-Based ASR with Lightweight and Dynamic Convolutions
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper proposes to apply lightweight and dynamic convolution to E2E ASR as an alternative architecture to the self-attention to make the computational order linear and proposes joint training with connectionist temporal classification, convolution on the frequency axis, and combination with self-Attention.
End-To-End Multi-Speaker Speech Recognition With Transformer
- Computer Science, PhysicsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals.
Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
- Computer ScienceSensors
- 2022
To develop the performance of an E2E agglutinative language speech recognition system, a new feature extractor is proposed, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales and is superior to VGGnet.
Hyperparameter experiments on end-to-end automatic speech recognition*
- Computer SciencePhonetics and Speech Sciences
- 2021
This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed and whichHyperparameters are altered in the encoder and decoder networks.
Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges
- Computer ScienceInterspeech
- 2021
A speaker classifier with a gradient reversal layer is included in the training phase to improve the robustness to speaker variation and build and compare end-to-end (E2E) systems and Deep Neural Network Hidden Markov Model (DNN-HMM) systems.
References
SHOWING 1-10 OF 25 REFERENCES
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer.
End-to-end attention-based large vocabulary speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Joint CTC-attention based end-to-end speech recognition using multi-task learning
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese
- Computer ScienceINTERSPEECH
- 2018
Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into…
Self-attention Networks for Connectionist Temporal Classification in Speech Recognition
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work proposes SAN-CTC, a deep, fully self-attentional network for CTC, and shows it is tractable and competitive for end-to-end speech recognition, and explores how label alphabets affect attention heads and performance.
End-to-end Speech Recognition With Word-Based Rnn Language Models
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
A novel word-based RNN-LM is proposed, which allows us to decode with only the word- based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM.
Improved training of end-to-end attention models for speech recognition
- Computer ScienceINTERSPEECH
- 2018
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.
Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
- Computer ScienceINTERSPEECH
- 2017
This work learns to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network and beats out traditional hybrid ASR systems on spontaneous Japanese and Chinese speech.
Very deep convolutional networks for end-to-end speech recognition
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.