Inverted Alignments for End-to-End Automatic Speech Recognition

@article{Doetsch2017InvertedAF,
  title={Inverted Alignments for End-to-End Automatic Speech Recognition},
  author={Patrick Doetsch and Mirko Hannemann and Ralf Schl{\"u}ter and Hermann Ney},
  journal={IEEE Journal of Selected Topics in Signal Processing},
  year={2017},
  volume={11},
  pages={1265-1273}
}
In this paper, we propose an inverted alignment approach for sequence classification systems like automatic speech recognition (ASR) that naturally incorporates discriminative, artificial-neural-network-based label distributions. Instead of aligning each input frame to a state label as in the standard hidden Markov model (HMM) derivation, we propose to inversely align each element of an HMM state label sequence to a segment-wise encoding of several consecutive input frames. This enables an… 

Figures and Tables from this paper

Sequence Modeling and Alignment for LVCSR-Systems

TLDR
Two novel approaches to DNN-based ASR are discussed and analyzed, the attention-based encoder–decoder approach, and the (segmental) inverted HMM approach, with specific focus on the sequence alignment behavior of the different approaches.

Exploring A Zero-Order Direct Hmm Based on Latent Attention for Automatic Speech Recognition

TLDR
A simple yet elegant latent variable attention model for automatic speech recognition (ASR) which enables an integration of attention sequence modeling into the direct hidden Markov model (HMM) concept and qualitatively analyze the alignment behavior of the different approaches.

Segmental Encoder-Decoder Models for Large Vocabulary Automatic Speech Recognition

TLDR
Different length modeling approaches for segmental models, their relation to attention-based systems and the first reported results on the Switchboard 300h speech recognition corpus using this approach are explored.

Improved training of end-to-end attention models for speech recognition

TLDR
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.

C L ] 8 M ay 2 01 8 Improved training of end-to-end attention models for speech recognition

TLDR
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.

A comprehensive analysis on attention models

TLDR
This work investigates on pretraining variants such as growing in depth and width, and their impact on the final performance, which leads to over 8% relative improvement in word error rate.

A semantic parsing pipeline for context-dependent question answering over temporally structured data

We propose a new setting for question answering (QA) in which users can query the system using both natural language and direct interactions within a graphical user interface that displays multiple

Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review

TLDR
A systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing, and presents the available models for a wide variety of applications.

An LSTM‐based cell association scheme for proactive bandwidth management in 5G fog radio access networks

TLDR
A novel mobility‐aware cell association scheme (MACA) that exploits user's mobility and downlink rate demand information to associate it with the maximum rate offering cell and results show that the proposed scheme performs significantly better than the other schemes and yields the average next cell prediction accuracy.

References

SHOWING 1-10 OF 55 REFERENCES

Inverted HMM - a Proof of Concept

TLDR
This work proposes an inverted hidden Markov model (HMM) approach to automatic speech and handwriting recognition that naturally incorporates discriminative, artificial neural network based label distributions and inversely aligns each element of an HMM state label sequence to a single input frame.

End-to-end attention-based large vocabulary speech recognition

TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

Joint CTC-attention based end-to-end speech recognition using multi-task learning

TLDR
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Learning acoustic frame labeling for speech recognition with recurrent neural networks

  • H. SakA. Senior J. Schalkwyk
  • Computer Science, Physics
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
It is shown that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTm RNN model trained with CE using HMM state alignments, and the effect of sequence discriminative training on these models is shown.

Discriminative segmental cascades for feature-rich phone recognition

TLDR
It is shown that beam search is not suitable for learning rescoring models in this approach, though it gives good approximate decoding performance when the model is already well-trained, and an approach inspired by structured prediction cascades, which use max-marginal pruning to generate lattices is considered.

Deep segmental neural networks for speech recognition

TLDR
The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.

A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition

TLDR
This paper studies the RNN encoder-decoder approach for large vocabulary end-to-end speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequences of feature representations, from which a decoder recovers asequence of words.

Attention-Based Models for Speech Recognition

TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

GMM-Free Flat Start Sequence-Discriminative DNN Training

TLDR
The sequence-discriminative flat start training method is not only significantly faster than the straightforward approach of iterative retraining and realignment, but the word error rates attained are slightly better as well.
...