Deep Context: End-to-end Contextual Speech Recognition

@article{Pundak2018DeepCE,
  title={Deep Context: End-to-end Contextual Speech Recognition},
  author={Golan Pundak and Tara N. Sainath and Rohit Prabhavalkar and Anjuli Kannan and Ding Zhao},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
  year={2018},
  pages={418-425}
}
  • G. Pundak, T. Sainath, +2 authors Ding Zhao
  • Published 7 August 2018
  • Computer Science, Engineering, Mathematics
  • 2018 IEEE Spoken Language Technology Workshop (SLT)
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to as Contextual Listen, Attend and Spell (CLAS) jointly-optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases… Expand
Context-Aware Transformer Transducer for Speech Recognition
TLDR
A novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals and exploring different techniques to encode contextual data and to create the final attention context vectors is presented. Expand
End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder
TLDR
This work proposes to use class-based language models (CLM) that can populate context-dependent information during inference for contextual speech recognition, and proposes a token passing algorithm with an efficient token recombination for E2E ASR. Expand
Phoebe: Pronunciation-aware Contextualization for End-to-end Speech Recognition
TLDR
This work proposes an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while leveraging pronunciations for words which might be likely in a given context. Expand
Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR
TLDR
This work improves the CLAS approach by proposing several new strategies to extract embeddings for the contextual entities and comparing these embedding extractors based on graphemic and phonetic input and/or output sequences shows that an encoder-decoder model trained jointly towards graphemes and phonemes outperforms other approaches. Expand
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
TLDR
This work proposes a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities and applies the aforementioned technique to an E 2E ASR system, which transcribes doctor and patient conversations, for better adapting the E1E system to the names in the conversations. Expand
Audio-Attention Discriminative Language Model for ASR Rescoring
  • Ankur Gandhe, A. Rastrow
  • Engineering, Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is shown that learning to rescore a list of potential ASR outputs is much simpler than learning to generate the hypothesis, and the proposed model results in up to 8% improvement in word error rate even when the amount of training data is a fraction of data used for training the first-pass system. Expand
Contextual Speech Recognition with Difficult Negative Training Examples
  • Uri Alon, G. Pundak, T. Sainath
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This work presents a novel and simple approach for training an ASR context mechanism with difficult negative examples that focuses on proper nouns in the reference transcript and uses phonetically similar phrases as negative examples, encouraging the neural model to learn more discriminative representations. Expand
Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition
TLDR
This paper supplements an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly and is able to recognize more than 85% of newly added words that it previously failed to recognize compared to a strong baseline. Expand
Shallow-Fusion End-to-End Contextual Biasing
TLDR
It is shown that the proposed approach to shallow-fusionbased biasing for end-toend models obtains better performance than a state-ofthe-art conventional model across a variety of tasks, the first time this has been demonstrated. Expand
Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
TLDR
An E2E model containing both English wordpieces and phonemes in the modeling space is proposed, and it is found that the proposed approach performs 16% better than a grapheme-only biasing model, and 8%better than a wordpiece-onlyBiasing model on a foreign place name recognition task, with only slight degradation on regular English tasks. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Contextual Speech Recognition in End-to-end Neural Network Systems Using Beam Search
TLDR
A technique to adapt the inference process to take advantage of contextual signals by adjusting the output likelihoods of the neural network at each step in the beam search is introduced and is effective at incorporating context into the prediction of an E2E system. Expand
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention. Expand
Bringing contextual information to google speech recognition
TLDR
This paper utilizes an on-the-fly rescoring mechanism to adjust the LM weights of a small set of n-grams relevant to the particular context during speech decoding, which handles out of vocabulary words. Expand
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. Expand
Composition-based on-the-fly rescoring for salient n-gram biasing
TLDR
A technique for dynamically applying contextually-derived language models to a state-of-the-art speech recognition system and a construction algorithm which takes a trie representing the contextual n-grams and produces a weighted finite state automaton which is more compact than a standard n- gram machine is introduced. Expand
An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model
TLDR
This work demonstrates that the use of shallow fusion with an neural LM with wordpieces yields a 9.1% relative word error rate reduction over the authors' competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring on Google Voice Search. Expand
Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer
TLDR
This work investigates training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T) and finds that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors. Expand
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
TLDR
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Listen, Attend and Spell
TLDR
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models. Expand
...
1
2
3
4
...