Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings

  title={Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings},
  author={Prakhar Swarup and Roland Maas and Srinivas Garimella and Sri Harish Reddy Mallidi and Bj{\"o}rn Hoffmeister},
In automatic speech recognition, confidence measures provide a quantitative representation used to assess the reliability of generated hypothesis text. For personal assistant devices like Alexa, speech recognition errors are inevitable due to the growing number of applications. Hence, confidence scores provide an important metric to downstream consumers to gauge the correctness of ASR hypothesis text and to subsequently initiate appropriate actions. In this work, our aim is to improve the… 
An Evaluation of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition
This paper provides an extensive benchmark of popular confidence methods on four well-known speech datasets, and suggests that a strong baseline can be obtained by scaling the logits by a learnt temperature, followed by estimating the confidence as the negative entropy of the predictive distribution and sum pooling to aggregate at word level.
Confidence Measures in Encoder-Decoder Models for Speech Recognition
This work presents a novel method which uses internal neural features of a frozen ASR model to train an independent neural network to predict a softmax temperature value, computed in each decoder time step and multiplied by the logits in order to redistribute the output probabilities.
Learning Word-Level Confidence for Subword End-To-End ASR
  • David Qiu, Qiujia Li, +9 authors Ian McGraw
  • Engineering, Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
Two confidence models of increasing complexity are proposed to solve the problem of word-level confidence estimation in subword-based end-to-end E2E models for automatic speech recognition (ASR) and the final model uses self-attention to directly learn word- level confidence without needing subword tokenization.
Word Error Rate Estimation Without ASR Output: e-WER2
The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER, a novel approach to estimate the WER that uses a multistream end-to-end architecture.
ASR Rescoring and Confidence Estimation with ELECTRA
This work proposes an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP tasks, and an extended version of ELECTRA called phone-attentive ELECTRA (P-ELECTRA), which performs better in confidence estimation than BERT.
Efficient Large Scale Semi-Supervised Learning for CTC Based Acoustic Models
This paper presents largest ASR SSL experiment ever conducted till date where 75K hours of labeled and 1.2 million hours of unlabeled data is used for model training, and introduces couple of novel techniques to facilitate such a large scale experiment.
Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models
This paper proposes a novel Teacher-Student based knowledge distillation (KD) approach for sequence discriminative training, where reference state sequence of unlabeled data are estimated using a strong Bi-directional LSTM Teacher model which is then used to guide the sMBR training of a L STM Student model.
Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios
The proposed neural confidence measure (NCM) is trained as a binary classification task to accept or reject an endto-end speech recognition result and incorporates features from an encoder, a decoder, and an attention block of the attentionbased end-to- end speech recognition model to improve NCM significantly.
Cross-Modal ASR Post-Processing System for Error Correction and Utterance Rejection
  • Jing Du, Shiliang Pu, +5 authors Hongwei Zhou
  • Computer Science, Engineering
  • 2022
A crossmodal post-processing system for speech recognizers is proposed, which fuses acoustic features and textual features from different modalities, joints a confidence estimator and an error corrector in multi-task learning fashion and unifies error correction and utterance rejection modules.
Knowledge Distillation and Data Selection for Semi-Supervised Learning in CTC Acoustic Models.
The current study proposes a methodology for integration of two key ideas: 1) SSL using connectionist temporal classification (CTC) objective and teacher-student based learning and 2) Designing effective data-selection mechanisms for leveraging unlabelled data to boost performance of student models.


Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems
An end-of-utterance detector for real-time automatic speech recognition in far-field scenarios and shows the benefit of ASR decoder features, especially as a low cost alternative to ASR hypothesis em-beddings.
Finding consensus in speech recognition: word error minimization and other applications of confusion networks
We describe a new framework for distilling information from word lattices to improve the accuracy of the speech recognition output and obtain a more perspicuous representation of a set of alternative
Using word probabilities as confidence measures
  • F. Wessel, Klaus Macherey, R. Schlüter
  • Computer Science
    Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
  • 1998
An approach to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance, as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time.
Confidence measures for spoken dialogue systems
Improved confidence assessment for detection of word-level speech recognition errors, out of domain utterances and incorrect concepts in the CU Communicator system is provided and a neural network is considered to combine all features in each level.
Device-directed Utterance Detection
In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false
A probabilistic approach to confidence estimation and evaluation
A novel way of estimating confidences for words that are recognized by a speech recognition system is proposed, which makes use of generalized linear models as a means for combining various predictor scores so as to arrive at confidence estimates.
Large vocabulary decoding and confidence estimation using word posterior probabilities
  • Gunnar Evermann, P. Woodland
  • Computer Science
    2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
  • 2000
The paper investigates the estimation of word posterior probabilities based on word lattices and presents applications of these posteriors in a large vocabulary speech recognition system. A novel
Different confidence measures for word verification in speech recognition
Experimental results presented in this paper show that the proposed verification method improves the performance of the KWS systems by reducing the false alarm rate without a significant increase in the rejection of correctly detected key-words.
A training procedure for verifying string hypotheses in continuous speech recognition
A discriminative training procedure is proposed for verifying the occurrence of string hypotheses produced by a hidden Markov model (HMM) based continuous speech recognizer to increase the power of a hypothesis test for utterance verification.
GloVe: Global Vectors for Word Representation
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.