Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings

@inproceedings{Swarup2019ImprovingAC,
  title={Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings},
  author={Prakhar Swarup and Roland Maas and Srinivas Garimella and Sri Harish Reddy Mallidi and Bj{\"o}rn Hoffmeister},
  booktitle={INTERSPEECH},
  year={2019}
}
In automatic speech recognition, confidence measures provide a quantitative representation used to assess the reliability of generated hypothesis text. For personal assistant devices like Alexa, speech recognition errors are inevitable due to the growing number of applications. Hence, confidence scores provide an important metric to downstream consumers to gauge the correctness of ASR hypothesis text and to subsequently initiate appropriate actions. In this work, our aim is to improve the… Expand
An Evaluation of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition
TLDR
This paper provides an extensive benchmark of popular confidence methods on four well-known speech datasets, and suggests that a strong baseline can be obtained by scaling the logits by a learnt temperature, followed by estimating the confidence as the negative entropy of the predictive distribution and sum pooling to aggregate at word level. Expand
Confidence Measures in Encoder-Decoder Models for Speech Recognition
TLDR
This work presents a novel method which uses internal neural features of a frozen ASR model to train an independent neural network to predict a softmax temperature value, computed in each decoder time step and multiplied by the logits in order to redistribute the output probabilities. Expand
Learning Word-Level Confidence for Subword End-To-End ASR
  • David Qiu, Qiujia Li, +9 authors Ian McGraw
  • Engineering, Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
Two confidence models of increasing complexity are proposed to solve the problem of word-level confidence estimation in subword-based end-to-end E2E models for automatic speech recognition (ASR) and the final model uses self-attention to directly learn word- level confidence without needing subword tokenization. Expand
Word Error Rate Estimation Without ASR Output: e-WER2
TLDR
The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER, a novel approach to estimate the WER that uses a multistream end-to-end architecture. Expand
ASR Rescoring and Confidence Estimation with ELECTRA
TLDR
This work proposes an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP tasks, and an extended version of ELECTRA called phone-attentive ELECTRA (P-ELECTRA), which performs better in confidence estimation than BERT. Expand
Efficient Large Scale Semi-Supervised Learning for CTC Based Acoustic Models
TLDR
This paper presents largest ASR SSL experiment ever conducted till date where 75K hours of labeled and 1.2 million hours of unlabeled data is used for model training, and introduces couple of novel techniques to facilitate such a large scale experiment. Expand
Leveraging Unlabeled Speech for Sequence Discriminative Training of Acoustic Models
TLDR
This paper proposes a novel Teacher-Student based knowledge distillation (KD) approach for sequence discriminative training, where reference state sequence of unlabeled data are estimated using a strong Bi-directional LSTM Teacher model which is then used to guide the sMBR training of a L STM Student model. Expand
Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios
TLDR
The proposed neural confidence measure (NCM) is trained as a binary classification task to accept or reject an endto-end speech recognition result and incorporates features from an encoder, a decoder, and an attention block of the attentionbased end-to- end speech recognition model to improve NCM significantly. Expand
Knowledge Distillation and Data Selection for Semi-Supervised Learning in CTC Acoustic Models.
TLDR
The current study proposes a methodology for integration of two key ideas: 1) SSL using connectionist temporal classification (CTC) objective and teacher-student based learning and 2) Designing effective data-selection mechanisms for leveraging unlabelled data to boost performance of student models. Expand
Knowledge-based Conversational Search
TLDR
This thesis lays foundations for designing conversational search systems by analyzing the requirements and proposing concrete solutions for automating some of the basic components and tasks that such systems should support. Expand
...
1
2
...

References

SHOWING 1-10 OF 20 REFERENCES
Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems
TLDR
An end-of-utterance detector for real-time automatic speech recognition in far-field scenarios and shows the benefit of ASR decoder features, especially as a low cost alternative to ASR hypothesis em-beddings. Expand
Finding consensus in speech recognition: word error minimization and other applications of confusion networks
We describe a new framework for distilling information from word lattices to improve the accuracy of the speech recognition output and obtain a more perspicuous representation of a set of alternativeExpand
Using word probabilities as confidence measures
  • F. Wessel, Klaus Macherey, R. Schlüter
  • Computer Science
  • Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
  • 1998
TLDR
An approach to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance, as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time. Expand
Confidence measures for spoken dialogue systems
TLDR
Improved confidence assessment for detection of word-level speech recognition errors, out of domain utterances and incorrect concepts in the CU Communicator system is provided and a neural network is considered to combine all features in each level. Expand
Device-directed Utterance Detection
In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of falseExpand
A probabilistic approach to confidence estimation and evaluation
TLDR
A novel way of estimating confidences for words that are recognized by a speech recognition system is proposed, which makes use of generalized linear models as a means for combining various predictor scores so as to arrive at confidence estimates. Expand
Large vocabulary decoding and confidence estimation using word posterior probabilities
  • Gunnar Evermann, P. Woodland
  • Computer Science
  • 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
  • 2000
The paper investigates the estimation of word posterior probabilities based on word lattices and presents applications of these posteriors in a large vocabulary speech recognition system. A novelExpand
Different confidence measures for word verification in speech recognition
TLDR
Experimental results presented in this paper show that the proposed verification method improves the performance of the KWS systems by reducing the false alarm rate without a significant increase in the rejection of correctly detected key-words. Expand
A training procedure for verifying string hypotheses in continuous speech recognition
TLDR
A discriminative training procedure is proposed for verifying the occurrence of string hypotheses produced by a hidden Markov model (HMM) based continuous speech recognizer to increase the power of a hypothesis test for utterance verification. Expand
GloVe: Global Vectors for Word Representation
TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure. Expand
...
1
2
...