• Corpus ID: 4493985

Explorer End-to-end neural segmental models for speech recognition

  title={Explorer End-to-end neural segmental models for speech recognition},
  author={Haozhan Tang and Liang Lu and Kevin Gimpel and Chris Dyer and A. Richard Smith},
Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has been explored in several studies. In this work, we review neural segmental models, which can be viewed… 

Figures and Tables from this paper

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
A segment-based unsupervised clustering algorithm to re-assign class labels to the segments of speech utterances and improves the TD-SV performance of TCL-BN and ASR derived BN features with respect to their standalone counterparts.


Deep segmental neural networks for speech recognition
The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.
Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition
This paper proposes to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length.
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.
A comparison of training approaches for discriminative segmental models
This paper investigates various losses and introduces a new cost function for training segmental models and compares lattice rescoring results for multiple tasks and studies the impact of several choices required when optimizing these losses.
Simplifying long short-term memory acoustic models for fast training and decoding
To accelerate decoding of LSTMs, it is proposed to apply frame skipping during training, and frame skipping and posterior copying (FSPC) during decoding to resolve two challenges faced by LSTM models: high model complexity and poor decoding efficiency.
Multitask Learning with CTC and Segmental CRF for Speech Recognition
It is found that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models, and it is shown that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.
A segmental CRF approach to large vocabulary continuous speech recognition
  • G. Zweig, P. Nguyen
  • Computer Science
    2009 IEEE Workshop on Automatic Speech Recognition & Understanding
  • 2009
A segmental conditional random field framework for large vocabulary continuous speech recognition that allows for the joint or separate discriminative training of the acoustic and language models.
Attention-Based Models for Speech Recognition
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Classification and recognition with direct segment models
  • G. Zweig
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
Initial steps are taken at using segment based direct models on their own, first by developing a segment-based maximum entropy phone classifier, and then by utilizing the features in a segmental conditional random field for recognition.
Multiframe deep neural networks for acoustic modeling
This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neuralnetwork activations.