Tied & Reduced RNN-T Decoder

  title={Tied \& Reduced RNN-T Decoder},
  author={Rami Botros and Tara N. Sainath and Robert David and Emmanuel Guzman and Wei Li and Yanzhang He},
  • Rami Botros, T. Sainath, +3 authors Yanzhang He
  • Published 15 September 2021
  • Computer Science, Engineering
  • ArXiv
Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy [1, 2, 3]. This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications… Expand
1 Citations

Figures and Tables from this paper

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
On-device end-to-end (E2E) models have shown improvements over a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER) [1], and latency [2], measured by theExpand


Rnn-Transducer with Stateless Prediction Network
The results suggest that the RNNT prediction network does not function as the LM in classical ASR, and instead it merely helps the model align to the input audio, while the RnnT encoder and joint networks capture both the acoustic and the linguistic information. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Improving RNN Transducer Modeling for End-to-End Speech Recognition
This paperoptimizes the training algorithm of RNN-T to reduce the memory consumption so that it can have larger training minibatch for faster training speed and proposes better model structures so that Rnn-T models with the very good accuracy but small footprint are obtained. Expand
Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
This work finds that it can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Expand
End-to-end attention-based large vocabulary speech recognition
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. Expand
Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition
Experimental results demonstrate an MBR trained model outperforms a RNN-T trained model substantially and further improvements can be achieved if trained with an external NNLM. Expand
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention. Expand
Conformer: Convolution-augmented Transformer for Speech Recognition
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. Expand
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
This work introduces a novel theoretical framework that facilitates better learning in language modeling, and shows that this framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables. Expand
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
  • Qian Zhang, Han Lu, +4 authors Shankar Kumar
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Expand