Kartik Audhkhasi

Learn More
Detection of filled pauses is a challenging research problem which has several practical applications. It can be used to evaluate the spoken fluency skills of the speaker, to improve the performance of automatic speech recognition systems or to predict the mental state of the speaker. This paper presents an algorithm for filled pause detection that is based(More)
A popular approach for keyword search in speech files is the phone lattice search. Recently minimum edit distance (MED) has been used as a measure of similarity between strings rather than using simple string matching while searching the phone lattice for the keyword. In this paper, we propose a variation of the MED, where the substitution penalties are(More)
Word error rates on the Switchboard conversational corpus that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues: what is human performance, and how far down can we still drive speech recognition error rates? In trying to(More)
We present an analysis of several publicly available automatic speech recognizers (ASRs) in terms of their suitability for use in different types of dialogue systems. We focus in particular on cloud based ASRs that recently have become available to the community. We include features of ASR systems and desiderata and requirements for different dialogue(More)
Non-verbal speech cues serve multiple functions in human interaction such as maintaining the conversational flow as well as expressing emotions, personality, and interpersonal attitude. In particular, non-verbal vocalizations such as laughters are associated with affective expressions while vocal fillers are used to hold the floor during a conversation. The(More)
Professional manual transcription of speech is an expensive and time consuming process. This paper focuses on the problem of combining noisy transcriptions frommultiple non-expert transcribers, where the quality of work from each worker varies. Computing transcriber reliability is a difficult task in the absence of gold standard reference transcripts. Three(More)
We prove that noise can speed convergence in the backpropagation algorithm. The proof consists of two separate results. The first result proves that the backpropagation algorithm is a special case of the generalized Expectation-Maximization (EM) algorithm for iterative maximum likelihood estimation. The second result uses the recent EM noise benefit to(More)
We address the challenge of interpreting spoken input in a conversational dialogue system with an approach that aims to exploit the close relationship between the tasks of speech recognition and language understanding through joint modeling of these two tasks. Instead of using a standard pipeline approach where the output of a speech recognizer is the input(More)
Diversity or complementarity of automatic speech recognition (ASR) systems is crucial for achieving a reduction in word error rate (WER) upon fusion using the ROVER algorithm. We present a theoretical proof explaining this often-observed link between ASR system diversity and ROVER performance. This is in contrast to many previous works that have only(More)
In emotion recognition, a widely-used method to reconciliate disagreement between multiple human evaluators is to perform majority-voting on their assigned class labels. Instead, we propose asking evaluators to rank emotional categories given an audio clip, followed by a combination of these ranked lists. We compare two well-known ranked list voting methods(More)