Learn More
We show that the standard hypothesis scoring paradigm used in maximum-likelihood-based speech recognition systems is not optimal with regard to minimizing the word error rate, the commonly used performance metric in speech recognition. This can lead to sub-optimal performance , especially in high-error-rate environments where word error and sentence error(More)
SRI International is currently involved in the development of a new generation of software systems for automatic scoring of pronunciation as part of the Voice Interactive Language Training System (VILTS) project. This paper describes the goals of the VILTS system, the speech corpus, and the algorithm development. The automatic grading system uses SRI's(More)
We present a paradigm for the automatic assessment of pronunciation quality by machine. In this scoring paradigm, both native and nonnative speech data is collected, and a database of human-expert ratings is created to enable the development of a variety of machine scores. We rst discuss issues related to the design of speech databases, and the reliability(More)
— This paper addresses the issue of closed-set text-independent speaker identification from samples of speech recorded over the telephone. It focuses on the effects of acoustic mismatches between training and testing data, and concentrates on two approaches: 1) extracting features that are robust against channel variations and 2) transforming the speaker(More)
This paper proposes a probabilistic framework to deene and evaluate conndence measures for word recognition. We describe a novel method to combine diierent knowledge sources and estimate the conndence in a word hypothesis, via a neural network. We also propose a measure of the joint performance of the recognition and conndence systems. The deenitions and(More)
This paper studies the eects of handset distortion on telephone based speaker recognition performance, resulting in the following observations: (1) the major factor in speaker recognition errors is whether the handset type (e.g., electret, carbon) is dierent across training and testing, not whether the telephone lines are mismatched, (2) the distribution of(More)
A method is described for designing speaker recognition features that are robust to telephone handset distortion. The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear arti®cial neural network. The neural network is discriminatively trained to maximize speaker recognition performance(More)
Statistics of frame-level pitch have recently been used in speaker recognition systems with good results [1, 2, 3]. Although they convey useful long-term information about a speaker's distribution of f 0 values, such statistics fail to capture information about local dynamics in intonation that characterize an individual's speaking style. In this work, we(More)