An Asynchronous WFST-Based Decoder for Automatic Speech Recognition

@article{Lv2021AnAW,
  title={An Asynchronous WFST-Based Decoder for Automatic Speech Recognition},
  author={Hang Lv and Zhehuai Chen and Hainan Xu and Daniel Povey and Lei Xie and Sanjeev Khudanpur},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={6019-6023}
}
  • Hang Lv, Zhehuai Chen, S. Khudanpur
  • Published 16 March 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 25 REFERENCES
Fast OnThe-Fly Composition for Weighted Finite-State Transducers in 1 . 8 Million-Word Vocabulary Continuous Speech Recognition
TLDR
This paper proposes a new on-the-fly composition algorithm for Weighted Finite-State Transducers (WFSTs) in large-vocabulary continuous-speech recognition and achieves one-pass real-time speech recognition in an extremely large vocabulary of 1.8 million words.
Personalized speech recognition on mobile devices
We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5
Dynamic network decoding revisited
  • H. Soltau, G. Saon
  • Computer Science
    2009 IEEE Workshop on Automatic Speech Recognition & Understanding
  • 2009
TLDR
In this work, a dynamic network decoder capable of using large cross-word context models and large n-gram histories is presented and a systematic comparison to a static FSM based decoder finds the dynamic decoder can run at comparable speed when large language models are used, while the static decoder performs best for small language models.
A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition
TLDR
This work describes a pruned lattice-rescoring algorithm for ASR, improving the n-gram approximation method and bringing a 4x speedup for latticescoring with 4- gram approximation while giving better recognition accuracies than the standard algorithm.
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
TLDR
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.
Binary Deep Neural Networks for Speech Recognition
TLDR
Binary DNNs are utilized in speech recognition, which can achieve competitive performance and substantial speed up and results in much faster DNN inference since matrix multiplication is the most computationally expensive operation.
Improved Phoneme-History-Dependent Search Method for Large-Vocabulary Continuous-Speech Recognition
TLDR
The improved phonemehistory-dependent (PHD) search algorithm is presented, which reduced the number of errors by a maximum of 21% and can generate more compact and accurate word graphs than those of the original PHD search method.
Weighted finite-state transducers in speech recognition
TLDR
WFSTs provide a common and natural representation for hidden Markov models (HMMs), context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs, and general transducer operations combine these representations flexibly and efficiently.
Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs
Bidirectional long short-term memory (BLSTM) acoustic models provide a significant word error rate reduction compared to their unidirectional counterpart, as they model both the past and future
Lower Frame Rate Neural Network Acoustic Models
TLDR
On a large vocabulary Voice Search task, it is shown that with conventional models, one can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model, thus improving overall system speed.
...
1
2
3
...