• Corpus ID: 239050535

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

@article{Zhao2021AnIO,
  title={An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR},
  author={Huaibo Zhao and Yosuke Higuchi and Tetsuji Ogawa and Tetsunori Kobayashi},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.10402}
}
In the present paper, an attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system that provides high performance with low latency. The triggered attention mechanism, which performs autoregressive decoding triggered by the CTC spike, has shown to be effective in streaming ASR. However, in order to maintain high accuracy of alignment estimation based on CTC outputs, which is the key to its performance… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 31 REFERENCES
Triggered Attention for End-to-end Speech Recognition
TLDR
The proposed triggered attention (TA) decoder concept achieves similar or better ASR results in all experiments compared to the full-sequence attention model, while also limiting the decoding delay to two look-ahead frames, which in this setup corresponds to an output delay of 80 ms.
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR
TLDR
This work proposes several strategies during training by leveraging external hard alignments extracted from the hybrid model to reduce latency, and investigates to utilize the alignments in both the encoder and the decoder.
End-to-End ASR with Adaptive Span Self-Attention
TLDR
The method enables the network to learn an appropriate size and position of the window for each layer and head, and the newly introduced scheme can further control the window size depending on the future and past contexts to save both computational complexity and memory size.
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Improving the Performance of Online Neural Transducer Models
  • T. Sainath, C. Chiu, +4 authors Zhijeng Chen
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Improvements to Neural transducer are presented, including increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
TLDR
This work integrates connectionist temporal classification (CTC) with Transformer for joint training and decoding of automatic speech recognition (ASR) tasks and makes training faster than with RNNs and assists LM integration.
Monotonic Chunkwise Attention
TLDR
Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed, is proposed and shown that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition
TLDR
A new unidirectional neural network architecture of parallel time-delayed LSTM (PTDLSTM) streams is proposed, which limits the processing latency to 250 ms and shows significant improvements compared to prior art on a variety of ASR tasks.
...
1
2
3
4
...