End-to-end Speech Recognition Using Lattice-free MMI

@inproceedings{Hadian2018EndtoendSR,
  title={End-to-end Speech Recognition Using Lattice-free MMI},
  author={Hossein Hadian and H. Sameti and Daniel Povey and S. Khudanpur},
  booktitle={INTERSPEECH},
  year={2018}
}
We present our work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models. [...] Key Result We also compare with other end-to-end methods such as CTC in character-based and lexicon-free settings and show 5 to 25 percent relative reduction in word error rates on different large vocabulary tasks while using significantly smaller models.Expand
An Investigation of Multilingual ASR Using End-to-end LF-MMI
TLDR
End-to-end LF-MMI is indeed competitive on a low-resourced multilingual task, comfortably outperforming a connectionist temporal classification baseline and investigates the feasibility of biphone contexts, concluding that biphones carry language variability but are promising for multilingual ASR. Expand
End-to-End Speech Recognition: A review for the French Language
TLDR
This paper proposes a review of the existing end-to-end ASR approaches for the French language, comparing results to conventional state-of-the-art ASR systems and discussing which units are more suited to model theFrench language. Expand
Exploring Model Units and Training Strategies for End-to-End Speech Recognition
TLDR
It is shown that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark, and a multi-stage pretraining strategy is proposed, which gives 25.0% and 18. Expand
Jasper: An End-to-End Convolutional Neural Acoustic Model
TLDR
This paper reports state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data and introduces a new layer-wise optimizer called NovoGrad to improve training. Expand
Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models
TLDR
This paper performs discriminative adaptation using lattices obtained from a first pass decoding, an approach that can be readily integrated into the lattice-free maximum mutual information (LF-MMI) framework. Expand
Towards Using Context-Dependent Symbols in CTC Without State-Tying Decision Trees
TLDR
A CD symbol embedding network is trained together with the rest of the acoustic model and removes one of the last cases in which neural systems have to be bootstrapped from GMM-HMM ones. Expand
On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data
This work investigates semi-supervised training of acoustic models (AM) with the lattice-free maximum mutual information (LF-MMI) objective in practically relevant scenarios with a limited amount ofExpand
Effective Training End-to-End ASR systems for Low-resource Lhasa Dialect of Tibetan Language
TLDR
This paper focuses on training End-to-End ASR systems for Lhasa dialect using transformer-based models and investigates effective initialization strategies and introduces highly compressed and reliable sub-character units for acoustic modeling which have never been used before. Expand
Improved Training Strategies for End-to-End Speech Recognition in Digital Voice Assistants
TLDR
A novel discriminative initialization strategy is proposed by introducing a regularization term to penalize model for incorrectly hallucinating wake-words in early phases of training to address speech recognition training data corresponding to digital voice assistants. Expand
Advancing Sequence-to-Sequence Based Speech Recognition
TLDR
This work reports the lowest sequence-tosequence model based numbers on this task to date, but the single system even challenges the best result known in the literature, namely a hybrid model together with recurrent language model rescoring. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Flat-Start Single-Stage Discriminatively Trained HMM-Based Models for ASR
TLDR
This study investigates flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition and proposes a standalone system, which achieves word error rates comparable with that of the state-of-the-art multi-stage systems while being much faster to prepare. Expand
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
TLDR
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. Expand
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
This paper presents the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome, and presents rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone C TC models. Expand
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. Expand
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly. Expand
Standalone training of context-dependent deep neural network acoustic models
  • C. Zhang, P. Woodland
  • Computer Science
  • 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
This paper introduces a method for training state-of-the-art CD-DNN-HMMs without relying on such a pre-existing system, and achieves this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and cluster the equivalent output distributions of the untied CD-HMM states using the decision tree based state tying approach. Expand
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Expand
Exploring neural transducers for end-to-end speech recognition
TLDR
It is shown that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a languagemodel, on the popular Hub5'00 benchmark. Expand
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs. Expand
CTC in the Context of Generalized Full-Sum HMM Training
TLDR
A generalized hybrid HMM-NN training procedure using the full-sum over the hidden state-sequence and identify CTC as a special case of it is formulated and an analysis of the alignment behavior of such a training procedure is presented. Expand
...
1
2
3
4
...