Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation
@article{Braun2022NeuralTT, title={Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation}, author={Stefan Braun and Erik McDermott and Roger Hsiao}, journal={ArXiv}, year={2022}, volume={abs/2211.16270} }
The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and…
References
SHOWING 1-10 OF 18 REFERENCES
Pruned RNN-T for fast, memory-efficient ASR training
- Computer ScienceINTERSPEECH
- 2022
A method for faster and more memory-efficient RNN-T loss computation is introduced, obtained using a simple joiner network that is linear in the encoder and decoder embeddings; it is shown that this method can be evaluated without using much memory.
Improving RNN Transducer Modeling for End-to-End Speech Recognition
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
This paperoptimizes the training algorithm of RNN-T to reduce the memory consumption so that it can have larger training minibatch for faster training speed and proposes better model structures so that Rnn-T models with the very good accuracy but small footprint are obtained.
Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
An efficient implementation of the RNN-T forward-backward and Viterbi algorithms using standard matrix operations is presented, which allows us to easily implement the algorithm in TensorFlow by making use of the existing hardware-accelerated implementations of these operations.
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
- Computer Science, PhysicsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.
Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
This work investigates training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T) and finds that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors.
Exploring neural transducers for end-to-end speech recognition
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
It is shown that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a languagemodel, on the popular Hub5'00 benchmark.
Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
This work performs comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T using identical datasets and encoder model architecture to present the first comprehensive benchmark on these three widely used training criteria across a great many languages.
Attention-Based Models for Speech Recognition
- Computer ScienceNIPS
- 2015
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Streaming End-to-end Speech Recognition for Mobile Devices
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.
Torchaudio: Building Blocks for Audio and Speech Processing
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
An overview of the design principles, functionalities, and benchmarks of TorchAudio are provided and the implementation of several audio and speech operations and models are benchmarked.