Learn More
Every speech recognition system requires a signal representation that parametrically models the temporal evolution of the speech spectral envelope. Current parameterizations involve, either explicitly or implicitly, a set of energies from frequency bands which are often distributed in a mel scale. The computation of those energies is performed in diverse(More)
In this paper, we present the results of the Acoustic Event Detection (AED) and Classification (AEC) evaluations carried out in February 2006 by the three participant partners from the CHIL project. The primary evaluation task was AED of the testing portions of the isolated sound databases and seminar recordings produced in CHIL. Additionally, a secondary(More)
Acoustic event detection (AED) aims at determining the identity of sounds and their temporal position in the signals that are captured by one or several microphones. The AED problem has been recently proposed for meeting-room or classroom environments, where a specific set of meaningful sounds has been defined, and several evaluations have been carried out(More)
Acoustic events produced in controlled environments may carry information useful for perceptually aware interfaces. In this paper we focus on the problem of classifying 16 types of meeting-room acoustic events. First of all, we have defined the events and gathered a sound database. Then, several classifiers based on support vector machines (SVM) are(More)
The aim of this correspondence is to present a robust representation of speech, that is based on an AR modeling of the causal part of the autocorrelation sequence. Its performance in noisy speech recognition is compared with several related techniques, showing that it achieves better results for severe noise conditions.
Cepstral coefficients are widely used in speech recognition. In this paper, we claim that they are not the best way of representing the spectral envelope, at least for some usual speech recognition systems. In fact, cepstrum has several disadvantages: poor physical meaning, need of transformation, and low capacity of adaptation to some recognition systems.(More)
When performing speaker diarization, it is common practice to use an agglomerative clustering approach where the acoustic data is first split in small segments and then pairs of these segments are merged until a particular stopping point is reached. The diarization performance can be greatly improved by the use of a speech/non-speech detector. The use of a(More)