• Corpus ID: 73605089

Sequence Modeling and Alignment for LVCSR-Systems

  title={Sequence Modeling and Alignment for LVCSR-Systems},
  author={Eugen Beck and Albert Zeyer and Patrick Doetsch and Andr'e Merboldt and Ralf Schl{\"u}ter and Hermann Ney},
  booktitle={ITG Symposium on Speech Communication},
Today, modeling automatic speech recognition (ASR) systems using deep neural networks (DNNs) has led to considerable improvements in performance, with word error rates being approximately halved compared to the status we had 10 to 15 years ago. Current state-of-the-art systems, at least if they are trained on moderate to medium amounts of training data, still follow the classical separation into language models and generative acoustic models. Acoustic modeling in these systems follows the… 

Figures and Tables from this paper

Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

It is proved that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent and therefore show equal modeling power, and it is shown that blank probabilities translate into segment length probabilities and vice versa.

A comprehensive analysis on attention models

This work investigates on pretraining variants such as growing in depth and width, and their impact on the final performance, which leads to over 8% relative improvement in word error rate.

An Analysis of Local Monotonic Attention Variants

A simple technique to implement windowed attention, which can be applied on top of an existing global attention model, and it is shown that the proposed model can be trained from random initialization and achieve results comparable to the global attention baseline.

A study of latent monotonic attention variants

This paper presents a mathematically clean solution to introduce monotonicity, by introducing a new latent variable which represents the audio position or segment boundaries, and compares several monotonic latent models to the authors' global soft attention baseline.

A New Training Pipeline for an Improved Neural Transducer

It is found that the transducer model generalizes much better on longer sequences than the attention model, and outperforms the authors' attention model on Switchboard 300h by over 6% relative WER.



Inverted Alignments for End-to-End Automatic Speech Recognition

An inverted alignment approach for sequence classification systems like automatic speech recognition (ASR) that naturally incorporates discriminative, artificial-neural-network-based label distributions and allows for a variety of model assumptions, including statistical variants of attention.

Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition

This paper proposes to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length.

Deep segmental neural networks for speech recognition

The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.

Segmental Recurrent Neural Networks for End-to-End Speech Recognition

Practical training and decoding issues as well as the method to speed up the training in the context of speech recognition are discussed, and the model is self-contained and can be trained end-to-end.

On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition

This paper presents a more effective stochastic gradient decent (SGD) learning rate schedule that can significantly improve the recognition accuracy, and demonstrates that using multiple recurrent layers in the encoder can reduce the word error rate.

Improved training of end-to-end attention models for speech recognition

This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.

A segmental CRF approach to large vocabulary continuous speech recognition

  • G. ZweigP. Nguyen
  • Computer Science
    2009 IEEE Workshop on Automatic Speech Recognition & Understanding
  • 2009
A segmental conditional random field framework for large vocabulary continuous speech recognition that allows for the joint or separate discriminative training of the acoustic and language models.

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.

RASR/NN: The RWTH neural network toolkit for speech recognition

The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems.