Corpus ID: 398770

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

@inproceedings{Seide2012ConversationalST,
  title={Conversational Speech Transcription Using Context-Dependent Deep Neural Networks},
  author={Frank Seide and Gang Li and Dong Yu},
  booktitle={ICML},
  year={2012}
}
Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training. CD-DNN-HMMs greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs: The word error rate is reduced by up to one third on the difficult benchmarking task of speaker-independent single-pass transcription of telephone conversations. 
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
TLDR
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls. Expand
Standalone training of context-dependent deep neural network acoustic models
  • C. Zhang, P. Woodland
  • Computer Science
  • 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
This paper introduces a method for training state-of-the-art CD-DNN-HMMs without relying on such a pre-existing system, and achieves this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and cluster the equivalent output distributions of the untied CD-HMM states using the decision tree based state tying approach. Expand
Context-dependent Deep Neural Networks for audio indexing of real-life data
TLDR
It is found that for the best speaker-independent CD-DNN-HMM, with 32k senones trained on 2000h of data, the one-fourth reduction does carry over to inhomogeneous field data, and that DNN likelihood evaluation is a sizeable runtime factor even in the wide-beam context of generating rich lattices. Expand
Improving English Conversational Telephone Speech Recognition
TLDR
This work investigated several techniques to improve acoustic modeling, namely speaker-dependent bottleneck features, deep Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks, data augmentation and score fusion of DNN and BLSTM models. Expand
Pipelined Back-Propagation for Context-Dependent Deep Neural Networks
TLDR
It is shown that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server. Expand
Context-dependent deep neural networks for commercial Mandarin speech recognition applications
TLDR
It is demonstrated that CD-DNN-HMMs can get relative 26% word error reduction and relative 16% sentence error reduction in Baidu's short message (SMS) voice input and voice search applications, respectively, compared with state-of-the-art CD-GMM-HMM trained using fMPE. Expand
Pipelined BackPropagation for Context-Dependent Deep Neural Networks
The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM, is a recently proposed acoustic-modeling technique for HMM-based speech recognition that can greatly outperform conventionalExpand
Fast-LSTM acoustic model for distant speech recognition
TLDR
The proposed Fast-long short-term memory Neural Network (Fast-LSTM) acoustic model combines the time delay neural network (TDNN) and LSTM network to reduce the training time of the standard L STM acoustic model. Expand
Context dependent state tying for speech recognition using deep neural network acoustic models
  • M. Bacchiani, David Rybach
  • Computer Science
  • 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
An algorithm to design a tied-state inventory for a context dependent, neural network-based acoustic model for speech recognition that optimizes state tying on the activation vectors of the neural network directly is proposed. Expand
Recent Improvements to Neural Network based Acoustic Modeling in the EML Transcription Platform
In recent years, automatic speech recognition has enjoyed tremendous improvements from the use of (deep) neural networks (DNNs) for both acoustic modeling and stochastic language modeling [1, 2].Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs. Expand
Deep Belief Networks for phone recognition
TLDR
Deep Belief Networks (DBNs) have recently proved to be very effective in a variety of machine learning problems and this paper applies DBNs to acous ti modeling. Expand
Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system
TLDR
A new training procedure that "smooths" networks with different degrees of context dependence is proposed to obtain a robust estimate of the context-dependent probabilities of the HMM/MLP speaker-independent continuous speech recognition system. Expand
ACID/HNN: clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling
  • J. Fritsch, M. Finke
  • Computer Science
  • Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
  • 1998
TLDR
It is argued that a hierarchical approach is crucial in applying locally discriminative connectionist models to the typically very large state spaces observed in LVCSR systems. Expand
Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition
TLDR
It is shown that pre-training can initialize weights to a point in the space where fine-tuning can be effective and thus is crucial in training deep structured models and in the recognition performance of a CD-DBN-HMM based large-vocabulary speech recognizer. Expand
Recent innovations in speech-to-text transcription at SRI-ICSI-UW
TLDR
It is shown that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker, and speech modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Expand
Connectionist probability estimators in HMM speech recognition
TLDR
It is shown that a connectionist component improves a state-of-the-art HMM system through a statistical interpretation of connectionist networks as probability estimators. Expand
Vocabulary-independent speech recognition: the Vocind System
Learning representations by back-propagating errors
TLDR
Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain. Expand
A Fast Learning Algorithm for Deep Belief Nets
TLDR
A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. Expand
...
1
2
3
...