Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

  title={Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging},
  author={Rohit Prabhavalkar and Yanzhang He and David Rybach and Sean Campbell and Arun Narayanan and Trevor Strohman and Tara N. Sainath},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Rohit Prabhavalkar, Yanzhang He, +4 authors T. Sainath
  • Published 12 December 2020
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model’s accuracy, and its impact on the efficiency of the decoding… Expand

Figures and Tables from this paper

Tied & Reduced RNN-T Decoder
This work studies ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance, and performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network’s output layer. Expand
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
This work proposes a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities and applies the aforementioned technique to an E 2E ASR system, which transcribes doctor and patient conversations, for better adapting the E1E system to the names in the conversations. Expand
Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition
Commonly used automatic speech recognition (ASR) systems can be classified into frame-synchronous and labelsynchronous categories, based on whether the speech is decoded on a per-frame or per-labelExpand
Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition
  • Zhiyun Lu, Yanwei Pan, +4 authors Trevor Strohman
  • Computer Science, Engineering
  • ArXiv
  • 2021
This paper presents an empirical study on the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model, and shows that for both losses, the WER on long-form speech reduces substantially as the training utterances length increases. Expand
Recent Advances in End-to-End Automatic Speech Recognition
  • Jinyu Li
  • Computer Science, Engineering
  • ArXiv
  • 2021
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective. Expand
Multitask Training with Text Data for End-to-End Speech Recognition
A multitask training method for attention-based end-to-end speech recognition models to better incorporate language level information and is comparable to language model shallow fusion, without requiring an additional neural network during decoding. Expand


Recognizing Long-Form Speech Using Streaming End-to-End Models
This work examines the ability of E2E models to generalize to unseen domains, and proposes two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. Expand
Efficient lattice rescoring using recurrent neural network language models
Two novel lattice rescoring methods for RNNLMs are investigated, one of which uses an n-gram style clustering of history contexts and the other exploits the distance measure between hidden history vectors. Expand
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention. Expand
Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer
This work investigates training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T) and finds that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors. Expand
Lattice Generation in Attention-Based Speech Recognition Models
A convolutional architecture is proposed, which facilitates comparing states of the model at different pi and obtains lower word error rates with smaller beam sizes, than an otherwise similar architecture with regular beam search. Expand
A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency
  • T. Sainath, Yanzhang He, +26 authors Ding Zhao
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A first-pass Recurrent Neural Network Transducer model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency is developed and found that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional models. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Towards Better Decoding and Language Model Integration in Sequence to Sequence Models
An attention-based seq2seq speech recognition system that directly transcribes recordings into characters is analysed, observing two shortcomings: overconfidence in its predictions and a tendency to produce incomplete transcriptions when language models are used. Expand
Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30%) while remaining streamable, compact, and computationally efficient with complexity of O(T), where T is input sequence length. Expand
Monotonic Recurrent Neural Network Transducer and Decoding Strategies
This work introduces a monotonic version of the RNNT loss that can be used with forward-backward algorithm to learn strictlymonotonic alignments between the sequences and shows that breadth-first search is effective in exploring and combining alternative alignments. Expand