Automatic Learning of Subword Dependent Model Scales
@article{Meyer2021AutomaticLO, title={Automatic Learning of Subword Dependent Model Scales}, author={Felix Meyer and Wilfried Michel and Mohammad Zeineldeen and Ralf Schl{\"u}ter and Hermann Ney}, journal={ArXiv}, year={2021}, volume={abs/2110.09324} }
To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to optimize these scaling parameters via automatic differentiation and stochastic gradient decent similar to…
References
SHOWING 1-10 OF 18 REFERENCES
Early Stage LM Integration Using Local and Global Log-Linear Combination
- Computer ScienceINTERSPEECH
- 2020
This work presents a novel method for language model integration into implicit-alignment based sequence-to-sequence models with good improvements over standard model combination (shallow fusion) on the state-of-the-art Librispeech system.
Log-linear model combination with word-dependent scaling factors
- Computer ScienceINTERSPEECH
- 2009
This work combines three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task and makes the scaling factor word and pronunciation-dependent, which reduces error rate reduction by 2% relative.
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
This paper compares a suite of past methods and some of their own proposed methods for using unpaired text data to improve encoder-decoder models, and results confirm the benefits of using unpaired text across a range of methods and data sets.
Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work considers two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which are found to be more effective than the sampling-based method.
A Comparison of Transformer and LSTM Encoder Decoder Models for ASR
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
We present competitive results using a Transformer encoder-decoder-attention model for end-to-end speech recognition needing less training time compared to a similarly performing LSTM model. We…
Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition
- Computer ScienceINTERSPEECH
- 2020
This work proposes a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer, which re-calculate and sum scores of all the possible alignments for each hypothesis in N-best lists and calculates hypothesis probability scores and back-propagated gradients efficiently using the forward-backward algorithm.
On Using Monolingual Corpora in Neural Machine Translation
- Computer ScienceArXiv
- 2015
This work investigates how to leverage abundant monolingual corpora for neural machine translation to improve results for En-Fr and En-De translation and extends to high resource languages such as Cs-En and De-En.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional…
Librispeech: An ASR corpus based on public domain audio books
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Neural Machine Translation by Jointly Learning to Align and Translate
- Computer ScienceICLR
- 2015
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.