Multi-mode Transformer Transducer with Stochastic Future Context

@inproceedings{Kim2021MultimodeTT,
  title={Multi-mode Transformer Transducer with Stochastic Future Context},
  author={Kwangyoun Kim and Felix Wu and Prashant Sridhar and Kyu J. Han and Shinji Watanabe},
  booktitle={Interspeech},
  year={2021}
}
Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naı̈vely, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency… 

Figures and Tables from this paper

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

TLDR
Experiments show that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy.

Recent Advances in End-to-End Automatic Speech Recognition

  • Jinyu Li
  • Computer Science
    APSIPA Transactions on Signal and Information Processing
  • 2022
TLDR
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective.

Conformer with dual-mode chunked attention for joint online and offline ASR

TLDR
Results show that the proposed dual-mode system using chunked attention yields 5% and 4% relative WER improvement on the Librispeech and medical tasks, compared to the dual- mode system using autoregressive attention with similar average lookahead.

References

SHOWING 1-10 OF 32 REFERENCES

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

TLDR
This paper presents a Transformer-Transducer model architecture and a training technique to unify streaming and non-streaming speech recognition models into one model, and shows that with limited right context and small additional latency at the end of decoding, this model can achieve similar accuracy with models using unlimited audio right context.

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

TLDR
Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

TLDR
It is found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective, by comparing several methods leveraging text-only data in the new domain.

Joint CTC/attention decoding for end-to-end speech recognition

TLDR
This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding.

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

TLDR
The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

  • Qian ZhangHan Lu Shankar Kumar
  • Computer Science, Physics
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

TLDR
The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately, and comfortably outperforms other state-of-the-art systems.

Attention is All you Need

TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

A Comparison of End-to-End Models for Long-Form Speech Recognition

  • C. ChiuWei Han Yonghui Wu
  • Computer Science
    2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2019
TLDR
This paper investigates and improves the performance of end-to-end models on long-form transcription and explores two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments.