Learn More
— We investigated a robust speech feature extraction method using kernel PCA (Principal Component Analysis) for distorted speech recognition. Kernel PCA has been suggested for various image processing tasks requiring an image model, such as denoising, where a noise-free image is constructed from a noisy input image [1]. Much research for robust speech(More)
We investigate a robust speech feature extraction method using kernel PCA (principal component analysis). Kernel PCA has been suggested for various image processing tasks requiring an image model such as, e.g., denoising, where a noise-free image is constructed from a noisy input image. Much research for robust speech feature extraction has been done, but(More)
This paper proposes a novel feature extraction method for speech recognition based on gradient features on a 2-D time-frequency matrix. Widely used MFCC features lack temporal dynamics. In addition, ΔMFCC is an indirect expression of temporal frequency changes. To extract the temporal dynamics more directly, we propose local gradient features in an(More)
This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input(More)
SUMMARY This paper describes a hands-free speech recognition technique based on acoustic model adaptation to reverberant speech. In hands-free speech recognition, the recognition accuracy is degraded by reverberation , since each segment of speech is affected by the reflection energy of the preceding segment. To compensate for the reflection signal we(More)
In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets(More)
— We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. The articulation of the first words spoken tends to be unstable due to the strain placed on the speech-related muscles, and this causes degradation of speech recognition. Therefore, we proposed a robust feature extraction method based on(More)
Random projection has been suggested as a means of dimensionality reduction, where the original data are projected onto a subspace using a random matrix. It represents a computationally simple method that approximately preserves the Euclidean distance of any two points through the projection. Moreover, as we are able to produce various random matrices,(More)
This paper presents an emotional voice conversion (VC) technology using non-negative matrix factorization, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The input source spectrum is decomposed into the source spectrum exemplars and their weights. By replacing source exemplars with target(More)
This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. DBNs have a deep architecture that automatically discovers abstractions to maximally express the original(More)