• Corpus ID: 239016056

Automatic Learning of Subword Dependent Model Scales

  title={Automatic Learning of Subword Dependent Model Scales},
  author={Felix Meyer and Wilfried Michel and Mohammad Zeineldeen and Ralf Schl{\"u}ter and Hermann Ney},
To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to optimize these scaling parameters via automatic differentiation and stochastic gradient decent similar to… 

Figures and Tables from this paper


Early Stage LM Integration Using Local and Global Log-Linear Combination
This work presents a novel method for language model integration into implicit-alignment based sequence-to-sequence models with good improvements over standard model combination (shallow fusion) on the state-of-the-art Librispeech system.
Log-linear model combination with word-dependent scaling factors
This work combines three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task and makes the scaling factor word and pronunciation-dependent, which reduces error rate reduction by 2% relative.
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition
This paper compares a suite of past methods and some of their own proposed methods for using unpaired text data to improve encoder-decoder models, and results confirm the benefits of using unpaired text across a range of methods and data sets.
Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models
This work considers two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which are found to be more effective than the sampling-based method.
A Comparison of Transformer and LSTM Encoder Decoder Models for ASR
We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We
Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition
This work proposes a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer, which re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists and calculates hypothesis probability scores and back-propagated gradients efficiently using the forward-backward algorithm.
On Using Monolingual Corpora in Neural Machine Translation
This work investigates how to leverage abundant monolingual corpora for neural machine translation to improve results for En-Fr and En-De translation and extends to high resource languages such as Cs-En and De-En.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Neural Machine Translation by Jointly Learning to Align and Translate
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.