GMM-Free Flat Start Sequence-Discriminative DNN Training

@inproceedings{Gosztolya2016GMMFreeFS,
  title={GMM-Free Flat Start Sequence-Discriminative DNN Training},
  author={G{\'a}bor Gosztolya and Tam{\'a}s Gr{\'o}sz and L{\'a}szl{\'o} T{\'o}th},
  booktitle={INTERSPEECH},
  year={2016}
}
Recently, attempts have been made to remove Gaussian mixture models (GMM) from the training process of deep neural network-based hidden Markov models (HMM/DNN). For the GMM-free training of a HMM/DNN hybrid we have to solve two problems, namely the initial alignment of the frame-level state labels and the creation of context-dependent states. Although flat-start training via iteratively realigning and retraining the DNN using a frame-level error function is viable, it is quite cumbersome. Here… 

Figures and Tables from this paper

A Comparative Evaluation of GMM-Free State Tying Methods for ASR

TLDR
Four refinements of the state tying algorithm to the HMM/DNN hybrid architecture were tested, and methods which utilized a decision criterion designed directly for neural networks consistently, and significantly, outperformed those which employed the standard Gaussian-based algorithm.

Inverted HMM - a Proof of Concept

TLDR
This work proposes an inverted hidden Markov model (HMM) approach to automatic speech and handwriting recognition that naturally incorporates discriminative, artificial neural network based label distributions and inversely aligns each element of an HMM state label sequence to a single input frame.

Inverted Alignments for End-to-End Automatic Speech Recognition

TLDR
An inverted alignment approach for sequence classification systems like automatic speech recognition (ASR) that naturally incorporates discriminative, artificial-neural-network-based label distributions and allows for a variety of model assumptions, including statistical variants of attention.

Calibrating DNN Posterior Probability Estimates of HMM/DNN Models to Improve Social Signal Detection from Audio Data

TLDR
This study shows experimentally that by employing a simple posterior probability calibration technique on the DNN outputs, the performance of the HMM/DNN workflow can be significantly improved.

Training Methods for Deep Neural Network-Based Acoustic Models in Speech Recognition

TLDR
New algorithms for obtaining the initial alignment of the frame-level state labels and the creation of context-dependent states are proposed, and found that they are better suited for the new acoustic models.

THE FUNCTIONAL EXPANSION OF AUTOMATIC SPEECH RECOGNITION OUTPUT BASED ON NOVEL PROSODY- AND TEXT-BASED APPROACHES

TLDR
This thesis shows that not only the manually-written texts, but also the automatically punctuated texts are more readable and understandable for humans than the pure word chain without punctuation.

Joint training methods for tandem and hybrid speech recognition systems using deep neural networks

Cambridge International Scholarship, Cambridge Overseas Trust Research funding, EPSRC Natural Speech Technology Project Research funding, DARPA BOLT Program Research funding, iARPA Babel Program

References

SHOWING 1-10 OF 34 REFERENCES

Building context-dependent DNN acoustic models using Kullback-Leibler divergence-based state tying

TLDR
This paper investigates an alternative state clustering method that uses the Kullback-Leibler (KL) divergence of DNN output vectors to build the decision tree, and shows that it is also beneficial for HMM/DNN hybrids.

Standalone training of context-dependent deep neural network acoustic models

  • Chao ZhangP. Woodland
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
This paper introduces a method for training state-of-the-art CD-DNN-HMMs without relying on such a pre-existing system, and achieves this in two steps: build a context-independent (CI) DNN iteratively with word transcriptions, and cluster the equivalent output distributions of the untied CD-HMM states using the decision tree based state tying approach.

A Sequence Training Method for Deep Rectifier Neural Networks in Speech Recognition

TLDR
Though Connectionist Temporal Classification was originally proposed for the training of recurrent neural networks, here it is shown that it can also be used to train more conventional feed-forward networks as well.

Sequence-discriminative training of deep neural networks

TLDR
Different sequence-discriminative criteria are shown to lower word error rates by 7-9% relative, on a standard 300 hour American conversational telephone speech task.

GMM-Free DNN Training

TLDR
It is shown that CD trees can be built with DNN alignments which are better matched to the DNN model and its features and that these trees and alignments result in better models than from the GMM alignments and trees.

GMM-free DNN acoustic model training

TLDR
It is shown that CD trees can be built with DNN alignments which are better matched to the DNN model and its features and that these trees and alignments result in better models than from the GMM alignments and trees.

Asynchronous, online, GMM-free training of a context dependent acoustic model for speech recognition

We propose an algorithm that allows online training of a context dependent DNN model. It designs a state inventory based on DNN features and jointly optimizes the DNN parameters and alignment of the

Decision tree clustering for KL-HMM

TLDR
An algorithm is presented that allows for decision tree clustering for KL-HMM based ASR systems and permits the synthesis of contexts that were unseen in the training data.

Chapter 8: Deep Neural Network Sequence-Discriminative Training

TLDR
This chapter introduces the sequence-discriminative training techniques that match better to the problem, and describes the popular maximum mutual information (MMI), boosted MMI, minimum phone error (MPE), and minimum Bayes risk (MBR) training criteria.

Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription

TLDR
This work investigates back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription and finds that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness.