Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition

  title={Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition},
  author={Tsendsuren Munkhdalai and Khe Chai Sim and A. N. Chandorkar and Fan Gao and Mason Chua and Trevor Strohman and Françoise Beaufays},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Tsendsuren MunkhdalaiK. Sim F. Beaufays
  • Published 5 October 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result. However, the traditional re-scoring approaches based on an external language model is prone to diverge during the personalized training. In this work, we introduce a model-based end-to-end contextual adaptation approach that is decoder-agnostic and amenable to on-device… 

Figures and Tables from this paper

Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition

This work investigates the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods, to allow for post-training adaptation to new data distributions to help adapt production ASR systems in challenging zero and few-shot scenarios.

On-the-fly ASR Corrections with Audio Exemplars

This work proposes to directly compare incoming audio embeddings against a list of Audio Exemplars (AE), each associated with a text correction, and demonstrates the effectiveness of this approach by correcting the outputs of a production-quality RNNT model.

CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals

A modular framework that allows incremental, scalable training of context-enhanced LMs, and can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model.



Contextual Speech Recognition with Difficult Negative Training Examples

This work presents a novel and simple approach for training an ASR context mechanism with difficult negative examples that focuses on proper nouns in the reference transcript and uses phonetically similar phrases as negative examples, encouraging the neural model to learn more discriminative representations.

Robust Continuous On-Device Personalization for Automatic Speech Recognition

It is found that quantizing and dequantizing the model weights in between training rounds can prevent the model from learning effectively, but this issue can be circumvented by adding noise to the quantized weights at the start of each training round.

Contextual RNN-T For Open Domain ASR

Modifications to the RNN-T model are proposed that allow the model to utilize additional metadata text with the objective of improving performance on Named Entities (WER-NE) for videos with related metadata.

Personalized speech recognition on mobile devices

We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5

Deep Context: End-to-end Contextual Speech Recognition

This work presents a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context, and jointly-optimizes the ASR components along with embeddings of the context n-grams.

Personalization of End-to-End Speech Recognition on Mobile Devices for Named Entities

This work evaluates the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user, and proposes using keyword-dependent precision and recall metrics to measure vocabulary acquisition performance.

Conformer: Convolution-augmented Transformer for Speech Recognition

This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

Speech recognition with deep recurrent neural networks

This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

Sparse Meta Networks for Sequential Adaptation and its Application to Adaptive Language Modelling

This work augments a deep neural network with a layer-specific fast-weight memory, generated sparsely at each time step and accumulated incrementally through time providing a useful inductive bias for online continual adaptation.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.