Yishuang Ning

Learn More
Speech is bimodal in nature. There are close correlations between the acoustic speech signals and the visual gestures such as lip movements, facial expressions and head motions. For speech driven talking avatar, how to derive more representative acoustic features from which to predict more accurate and realistic visual gestures still remains the research(More)
Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. As there are only a few emphasized words in a sentence, the problem of the data limitation is one of the most important problems for emphatic speech synthesis. In this paper, we analyze contrastive (neutral versus(More)
This paper proposes a new framework for emphasis detection from natural speech, where emphasis refers to a word or part of a word perceived as standing out from its surrounding words. Labeling emphatic words from speech recordings plays a significant role not only in human-computer interactions, but also in building speech corpus for expressive speech(More)
Labeling emphatic words from speech recordings plays an important role in building speech corpus for expressive speech synthesis. People generally pronounce some words stronger than usual, making the speech more expressive and signaling the focus of the sentence. Contrastive word pairs are often pronounced with stronger prominences and their presence(More)
Speech are widely used to express one's emotion, intention, desire, etc. in social network communication, deriving abundant of internet speech data with different speaking styles. Such data provides a good resource for social multimedia research. However, regarding different styles are mixed together in the internet speech data, how to classify such data(More)
From speech, speaker identity can be mostly characterized by the spectro-temporal structures of spectrum. Although recent researches have demonstrated the effectiveness of employing long short-term memory (LSTM) recurrent neural network (RNN) in voice conversion, traditional LSTM-RNN based approaches usually focus on temporal evolutions of speech features(More)
This paper investigates the incorporation of hidden Markov model (HMM) based emphatic speech synthesis for audio exaggeration into an audio-visual speech synthesis framework for the corrective feedback in computer-aided pronunciation training (CAPT). To improve the voice quality of the synthetic emphatic speech, this paper proposes a new method for HMM(More)