Learn More
We propose a rescoring framework for speech recognition that incorporates acoustic phonetic knowledge sources. The scores corresponding to all knowledge sources are generated from a collection of neural network based classifiers. Rescoring is then performed by combining different knowledge scores and uses them to reorder candidate strings provided by(More)
Acoustic event detection is an important step for audio content analysis and retrieval. Traditional detection techniques model the acoustic events on frame-based spectral features. Considering the temporal-frequency structures of acoustic events may be distributed in time-scales beyond frames, we propose to represent those structures as a bag of spectral(More)
We previously have applied deep autoencoder (DAE) for noise reduction and speech enhancement. However, the DAE was trained using only clean speech. In this study, we further introduce an explicit denoising process in learning the DAE. In training the DAE, we still adopt greedy layer-wised pretraining plus fine tuning strategy. In pretraining, each layer is(More)
We present methods of detector design in the Automatic Speech Attribute Transcription project. This paper details the results of a student-led, cross-site collaboration between Georgia Institute of Technology, The Ohio State University and Rutgers University. The work reported in this paper describes and evaluates the detection-based ASR paradigm and(More)
Denoising autoencoder (DAE) is effective in restoring clean speech from noisy observations. In addition, it is easy to be stacked to a deep denoising autoencoder (DDAE) architecture to further improve the performance. In most studies, it is supposed that the DAE or DDAE can learn any complex transform functions to approximate the transform relation between(More)
In order to incorporate long temporal-frequency structure for acoustic event detection, we have proposed a spectral patch based learning and representation method. The learned spectral patches were regarded as acoustic words which were further used in sparse encoding for acoustic feature representation and modeling. In our previous study, during feature(More)
  • Yu Tsao
  • 2008
We propose an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing environments in order to enhance performance robustness of automatic speech recognition systems under adverse conditions. The ESSEM process comprises two phases, the offline and the online. In the offline phase, we prepare an ensemble speaker and speaking(More)
Deep Neural Networks (DNNs) are becoming widely accepted in automatic speech recognition (ASR) systems. The deep structured nonlinear processing greatly improves the model's generalization capability, but the performance under adverse environments is still unsatisfactory. In the literature, there have been many techniques successfully developed to improve(More)
We propose an acoustic segment model (ASM) approach to incorporating temporal information into speaker modeling in text-independent speaker recognition. In training, the proposed framework first estimates a collection of ASM-based universal background models (UBMs). Multiple sets of speaker-specific ASMs are then obtained by adapting the ASM-based UBMs with(More)
The Gaussian mixture model (GMM)-based method has dominated the field of voice conversion (VC) for last decade. However , the converted spectra are excessively smoothed and thus produce muffled converted sound. In this study, we improve the speech quality by enhancing the dependency between the source (natural sound) and converted feature vectors (converted(More)