Segmental Encoder-Decoder Models for Large Vocabulary Automatic Speech Recognition

  title={Segmental Encoder-Decoder Models for Large Vocabulary Automatic Speech Recognition},
  author={Eugen Beck and Mirko Hannemann and Patrick Doetsch and Ralf Schl{\"u}ter and Hermann Ney},
It has been known for a long time that the classic HiddenMarkov-Model (HMM) derivation for speech recognition contains assumptions such as independence of observation vectors and weak duration modeling that are practical but unrealistic. When using the hybrid approach this is amplified by trying to fit a discriminative model into a generative one. Hidden Conditional Random Fields (CRFs) and segmental models (e.g. SemiMarkov CRFs / Segmental CRFs) have been proposed as an alternative, but for a… 

Tables from this paper

Sequence Modeling and Alignment for LVCSR-Systems
Two novel approaches to DNN-based ASR are discussed and analyzed, the attention-based encoder–decoder approach, and the (segmental) inverted HMM approach, with specific focus on the sequence alignment behavior of the different approaches.
Exploring A Zero-Order Direct Hmm Based on Latent Attention for Automatic Speech Recognition
A simple yet elegant latent variable attention model for automatic speech recognition (ASR) which enables an integration of attention sequence modeling into the direct hidden Markov model (HMM) concept and qualitatively analyze the alignment behavior of the different approaches.
Segment-level Training of ANNs Based on Acoustic Confidence Measures for Hybrid HMM/ANN Speech Recognition
We show that confidence measures estimated from local posterior probabilities can serve as objective functions for training ANNs in hybrid HMM based speech recognition systems. This leads to a
Learning to Count Words in Fluent Speech Enables Online Speech Recognition
This work introduces Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting that uses the cumulative word sum to dynamically segment speech and enable its eager decoding into words.
Segment boundary detection directed attention for online end-to-end speech recognition
A new attention mechanism for learning online alignment by decomposing the conventional alignment into two parts: segmentation —segment boundary detection with hard decision—and segment-directed attention —information aggregation within the segment with soft attention is proposed.
Netze in der automatischen Spracherkennung-ein Paradigmenwechsel ? Neural Networks in Automatic Speech Recognition-a Paradigm Change ?
In der automatischen Spracherkennung, wie dem maschinellen Lernen allgemein, werden die Strukturen der zugehörigen stochastischen Modellierung heute mehr und mehr auf unterschiedliche Formen
Computational intelligence in processing of speech acoustics: a survey
This paper presents a comprehensive survey on the speech recognition techniques for non-Indian and Indian languages, and compiled some of the computational models used for processing speech acoustics.
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept
It is proved that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent and therefore show equal modeling power, and it is shown that blank probabilities translate into segment length probabilities and vice versa.
A study of latent monotonic attention variants
This paper presents a mathematically clean solution to introduce monotonicity, by introducing a new latent variable which represents the audio position or segment boundaries, and compares several monotonic latent models to the authors' global soft attention baseline.


Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition
This paper proposes to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length.
A segmental CRF approach to large vocabulary continuous speech recognition
  • G. Zweig, P. Nguyen
  • Computer Science
    2009 IEEE Workshop on Automatic Speech Recognition & Understanding
  • 2009
A segmental conditional random field framework for large vocabulary continuous speech recognition that allows for the joint or separate discriminative training of the acoustic and language models.
Attention-Based Models for Speech Recognition
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Deep segmental neural networks for speech recognition
The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.
Inverted Alignments for End-to-End Automatic Speech Recognition
An inverted alignment approach for sequence classification systems like automatic speech recognition (ASR) that naturally incorporates discriminative, artificial-neural-network-based label distributions and allows for a variety of model assumptions, including statistical variants of attention.
Multitask Learning with CTC and Segmental CRF for Speech Recognition
It is found that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models, and it is shown that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.
From HMM's to segment models: a unified view of stochastic modeling for speech recognition
A general stochastic model is described that encompasses most of the models proposed in the literature for speech recognition, pointing out similarities in terms of correlation and parameter tying assumptions, and drawing analogies between segment models and HMMs.
Inverted HMM - a Proof of Concept
This work proposes an inverted hidden Markov model (HMM) approach to automatic speech and handwriting recognition that naturally incorporates discriminative, artificial neural network based label distributions and inversely aligns each element of an HMM state label sequence to a single input frame.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results
Initial results demonstrate that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.