Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR

@inproceedings{Chen2019JointGA,
  title={Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR},
  author={Zhehuai Chen and Mahaveer Jain and Yongqiang Wang and Michael L. Seltzer and Christian Fuegen},
  booktitle={INTERSPEECH},
  year={2019}
}
End-to-end approaches to automatic speech recognition, such as Listen-Attend-Spell (LAS), blend all components of a traditional speech recognizer into a unified model. Although this simplifies training and decoding pipelines, a unified model is hard to adapt when mismatch exists between training and test data, especially if this information is dynamically changing. The Contextual LAS (CLAS) framework tries to solve this problem by encoding contextual entities into fixed-dimensional embeddings… Expand
Context-Aware Transformer Transducer for Speech Recognition
TLDR
A novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals and exploring different techniques to encode contextual data and to create the final attention context vectors is presented. Expand
Deep Shallow Fusion for RNN-T Personalization
TLDR
This work presents novel techniques to improve RNN-T’s ability to model rare WordPieces, infuse extra information into the encoder, enable the use of alternative graphemic pronunciations, and perform deep fusion with personalized language models for more robust biasing. Expand
Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition
  • Guangzhi Sun, Chao Zhang, P. Woodland
  • Computer Science
  • ArXiv
  • 2021
TLDR
A novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attentionbased encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. Expand
Neural Lattice Search for Speech Recognition
  • Rao Ma, H. Li, Qi Liu, Lu Chen, Kai Yu
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This model is composed of a bidirectional LatticeLSTM encoder followed by an attentional LSTM decoder that generates the single best hypothesis from the given lattice space and yields 9.7% and 7.5% relative WER reduction compared to N-best rescoring methods and lattice rescuing methods within the same amount of decoding time. Expand
Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
TLDR
This paper uses an attention based method to extract contextual vector representations of video metadata, and uses these representations as part of the inputs to a neural language model during lattice rescoring, and proposes a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata. Expand
Recent Advances in End-to-End Automatic Speech Recognition
  • Jinyu Li
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective. Expand
Parallelizing Adam Optimizer with Blockwise Model-Update Filtering
  • Kai Chen, Haisong Ding, Qiang Huo
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
Experimental results show that BMUF-Adam achieves almost a linear speedup without recognition accuracy degradation and outperforms SSG-based method in terms of speedup, scalability and recognition accuracy. Expand
Cif-Based Collaborative Decoding for End-to-End Contextual Speech Recognition
  • M. Han, Linhao Dong, Shiyu Zhou, Bo Xu
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
This paper focuses on incorporating contextual information into the continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion and introduces an extra context processing network to extract contextual embeddings, integrate acoustically relevant contextual information and decode the contextual output distribution. Expand
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
TLDR
This work proposes a novel solution that combines shallow fusion, trie-based deep biasing, and neural network language model contextualization that results in significant relative Word Error Rate improvement over existing contextual biasing approaches and 5.4%–9.3% improvement compared to a strong hybrid baseline. Expand
Contextual RNN-T For Open Domain ASR
TLDR
Modifications to the RNN-T model are proposed that allow the model to utilize additional metadata text with the objective of improving performance on Named Entities (WER-NE) for videos with related metadata. Expand

References

SHOWING 1-10 OF 28 REFERENCES
Deep Context: End-to-end Contextual Speech Recognition
TLDR
This work presents a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context, and jointly-optimizes the ASR components along with embeddings of the context n-grams. Expand
Phoebe: Pronunciation-aware Contextualization for End-to-end Speech Recognition
TLDR
This work proposes an E2E model which benefits from the best of both worlds: it outputs graphemes, and thus learns to spell words directly, while leveraging pronunciations for words which might be likely in a given context. Expand
End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder
TLDR
This work proposes to use class-based language models (CLM) that can populate context-dependent information during inference for contextual speech recognition, and proposes a token passing algorithm with an efficient token recombination for E2E ASR. Expand
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. Expand
Contextual Speech Recognition in End-to-end Neural Network Systems Using Beam Search
TLDR
A technique to adapt the inference process to take advantage of contextual signals by adjusting the output likelihoods of the neural network at each step in the beam search is introduced and is effective at incorporating context into the prediction of an E2E system. Expand
Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
A joint word-character A2W model that learns to first spell the word and then recognize it and provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training. Expand
Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks
TLDR
This work proposes a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN) that has the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word- to-pronunciation conversion. Expand
End-to-End Speech Recognition Models
TLDR
This thesis proposes a novel approach to ASR with neural attention models and demonstrates the end-to-end speech recognition model, which can directly emit English/Chinese characters or even word pieces given the audio signal. Expand
On Modular Training of Neural Acoustics-to-Word Model for LVCSR
TLDR
A novel modular training framework of E2E ASR is proposed to separately train neural acoustic and language models during training stage, while still performing end-to-end inference in decoding stage. Expand
Joint-sequence models for grapheme-to-phoneme conversion
TLDR
A novel estimation algorithm is presented that demonstrates high accuracy on a variety of databases and studies the impact of the maximum approximation in training and transcription, the interaction of model size parameters, n-best list generation, confidence measures, and phoneme-to-grapheme conversion. Expand
...
1
2
3
...