Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

  title={Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation},
  author={Stefan Braun and Erik McDermott and Roger Hsiao},
The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and… 

Figures and Tables from this paper



Pruned RNN-T for fast, memory-efficient ASR training

A method for faster and more memory-efficient RNN-T loss computation is introduced, obtained using a simple joiner network that is linear in the encoder and decoder embeddings; it is shown that this method can be evaluated without using much memory.

Improving RNN Transducer Modeling for End-to-End Speech Recognition

This paperoptimizes the training algorithm of RNN-T to reduce the memory consumption so that it can have larger training minibatch for faster training speed and proposes better model structures so that Rnn-T models with the very good accuracy but small footprint are obtained.

Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow

An efficient implementation of the RNN-T forward-backward and Viterbi algorithms using standard matrix operations is presented, which allows us to easily implement the algorithm in TensorFlow by making use of the existing hardware-accelerated implementations of these operations.

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

  • Qian ZhangHan Lu Shankar Kumar
  • Computer Science, Physics
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer

This work investigates training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T) and finds that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors.

Exploring neural transducers for end-to-end speech recognition

It is shown that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a languagemodel, on the popular Hub5'00 benchmark.

Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR

This work performs comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T using identical datasets and encoder model architecture to present the first comprehensive benchmark on these three widely used training criteria across a great many languages.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

Streaming End-to-end Speech Recognition for Mobile Devices

This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.

Torchaudio: Building Blocks for Audio and Speech Processing

  • Yao-Yuan YangMoto Hira Yangyang Shi
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
An overview of the design principles, functionalities, and benchmarks of TorchAudio are provided and the implementation of several audio and speech operations and models are benchmarked.