Tied & Reduced RNN-T Decoder

  title={Tied \& Reduced RNN-T Decoder},
  author={Rami Botros and Tara N. Sainath and Robert David and Emmanuel Guzman and Wei Li and Yanzhang He},
Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy [1, 2, 3]. This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications… 

Figures and Tables from this paper

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
On-device end-to-end (E2E) models have shown improvements over a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER) [1], and latency [2], measured by the
Deliberation of Streaming RNN-Transducer by Non-autoregressive Decoding
  • Weiran Wang, Ke Hu, T. Sainath
  • Computer Science, Engineering
  • 2021
It is shown that the proposed Align-Refine nonautoregressive decoding method obtains significantly more accurate recognition results than the first-pass RNN-T, with only small amount of model parameters.


Rnn-Transducer with Stateless Prediction Network
The results suggest that the RNNT prediction network does not function as the LM in classical ASR, and instead it merely helps the model align to the input audio, while the RnnT encoder and joint networks capture both the acoustic and the linguistic information.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Improving RNN Transducer Modeling for End-to-End Speech Recognition
  • Jinyu Li, Rui Zhao, Hu Hu, Y. Gong
  • Computer Science, Engineering
    2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2019
This paperoptimizes the training algorithm of RNN-T to reduce the memory consumption so that it can have larger training minibatch for faster training speed and proposes better model structures so that Rnn-T models with the very good accuracy but small footprint are obtained.
Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging
This work finds that it can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline.
End-to-end attention-based large vocabulary speech recognition
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition
Experimental results demonstrate an MBR trained model outperforms a RNN-T trained model substantially and further improvements can be achieved if trained with an external NNLM.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Conformer: Convolution-augmented Transformer for Speech Recognition
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling
This work introduces a novel theoretical framework that facilitates better learning in language modeling, and shows that this framework leads to tying together the input embedding and the output projection matrices, greatly reducing the number of trainable variables.
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
  • Qian Zhang, Han Lu, +4 authors Shankar Kumar
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.