Chi-Chun Hsia

Learn More
This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for(More)
This paper proposes a method for modeling and generating pitch in hidden Markov model (HMM)-based Mandarin speech synthesis by exploiting prosody hierarchy and dynamic pitch features. The prosodic structure of a sentence is represented by a prosody hierarchy, which is constructed from the predicted prosodic breaks using a supervised classification and(More)
Sleeping posture reveals important information for eldercare and patient care, especially for bed ridden patients. Traditionally, some works address the problem from either pressure sensor or video image. This paper presents a multimodal approach to sleeping posture classification. Features from pressure sensor map and video image have been proposed in(More)
This study proposes a bed posture detection method using Bayesian classification for the elderly and bedridden. Only 16 long-narrow FSR (Force Sensing Resistor) sensors, rather than pressure distribution image from a set of sensor array are used for classification. Kurtosis and skewness are estimated as feature vector to represent the shape of pressure(More)
This study presents a fast speaker clustering method based on multidimensional scaling. Speech segments are trained as initial acoustic models. MDS is utilized to transform acoustic models to a space with the coordinate best preserve the distances or dissimilarity between models. Speaker clusters are clustered using vector quantization on the MDS(More)
This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The(More)
This paper presents a variable-length unit selection scheme based on syntactic cost to select text-to-speech (TTS) synthesis units. The syntactic structure of a sentence is derived from a probabilistic context-free grammar (PCFG), and represented as a syntactic vector. The syntactic difference between target and candidate units (words or phrases) is(More)
In emotional speech synthesis, a large speech database is required for high-quality speech output. Voice conversion needs only a compact-sized speech database for each emotion. This study designs and accumulates a set of phonetically balanced smallsized emotional parallel speech databases to construct conversion functions. The Gaussian mixture bigram model(More)
In this study, a conversion function clustering and selection approach to conversion-based expressive speech synthesis is proposed. First, a set of small-sized emotional parallel speech databases is designed and collected to train the conversion functions. Gaussian mixture bi-gram model (GMBM) is adopted as the conversion function to model the temporal and(More)