Learn More
Most of the current state-of-the-art speech recognition systems are based on speech signal parametrizations that crudely model the behavior of the human auditory system. However, little or no use is usually made of the knowledge on the human speech production system. A data-driven statistical approach to incorporate this knowledge into ASR would require a(More)
In this paper, we describe automatic speech recognition system where features extracted from human speech production system in form of articulatory movements data are effectively integrated in the acoustic model for improved recognition performance. The system is based on the hybrid HMM/BN model, which allows for easy integration of different speech(More)
In this paper, we describe the ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages (Japanese and Chinese). There are three main modules of our S2ST system: large-vocabulary continuous speech recognition, machine text-to-text (T2T) translation, and text-to-speech synthesis.(More)
When the reference speakers are represented by Gaussian mixture model (GMM), the conventional approach is to accumulate the frame likelihoods over the whole test utterance and compare the results as in speaker identi®cation or apply a threshold as in speaker veri®cation. In this paper we describe a method, where frame likelihoods are transformed into new(More)
SUMMARY In current HMM based speech recognition systems , it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions , etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies(More)
In this paper, we present a new discriminative training method for Gaussian Mixture Models (GMM) and its application for the text-independent speaker recognition. The objective of this method is to maximize the frame level normalized likelihoods of the training data. That is why we call it the Maximum Normalized Likelihood Estimation (MNLE). In contrast to(More)
In this paper, we describe new high-performance on-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classi¿cation. All modules share a set of Gaussian mixture models (GMM) representing(More)
It is difficult to recognize speech distorted by various factors, especially when an ASR system contains only a single acoustic model. One solution is to use multiple acoustic models, one model for each different condition. In this paper, we discuss a parallel decoding-based ASR system that is robust to the noise type, SNR, speaker gender and speaking(More)
Most of the current state-of-the-art speech recognition systems use the Hidden Markov Model (HMM) for modeling acoustical characteristics of a speech signal. In the first-order HMM, speech data are assumed to be independently and identically distributed (i.i.d.), meaning that there is no dependency between neighboring feature vectors. Another assumption is(More)
Current automatic speech recognition systems have two distinctive modes of operation: training and recognition. After the training, system parameters are fixed, and if a mismatch between training and testing conditions occurs, an adaptation procedure is commonly applied. However, the adaptation methods change the system parameters in such a way that(More)