• Publications
  • Influence
Improved training of end-to-end attention models for speech recognition
TLDR
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.
RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation
We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task.
Language Modeling with Deep Transformers
TLDR
The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of positional information and it is found that removing the positional encoding even slightly improves the performance of these models.
On Using SpecAugment for End-to-End Speech Translation
TLDR
This work investigates a simple data augmentation technique, SpecAugment, for end-to-end speech translation by alleviating overfitting to some extent and shows that the method also leads to significant improvements in various data conditions irrespective of the amount of training data.
A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition
TLDR
A pretraining scheme for LSTMs with layer-wise construction of the network showing good improvements especially for deep networks is introduced, and a comparison of computation times vs. recognition performance is compared.
A Comparison of Transformer and LSTM Encoder Decoder Models for ASR
We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We
Towards Online-Recognition with Deep Bidirectional LSTM Acoustic Models
TLDR
This work applies a modification to bidirectional RNNs to enable online-recognition by moving a window over the input stream and performing one forwarding through the RNN on each window to combine the posteriors of each forwarding and renormalize them.
RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition
TLDR
It is shown that a layer-wise pretraining scheme for recurrent attention models gives over 1% BLEU improvement absolute and it allows to train deeper recurrent encoder networks.
The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation
TLDR
This paper describes automatic speech recognition systems developed jointly by RWTH, UPB and FORTH for the 1ch, 2ch and 6ch track of the 4th CHiME Challenge and compares the ASR performance of different beamforming approaches.
Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models
TLDR
This work compares different approaches from the literature and proposes several novel methods to estimate the ILM directly from the AED model, which outperform all previous approaches.
...
...