Yusuke Ijima

  • Citations Per Year
Learn More
This paper describes an approach to HMM-based expressive speech synthesis which does not require any supervised labeling process for emphasis context. We use appealing-style speech whose sentences were taken from real domains. To reduce the cost for labeling speech data with an emphasis context for the model training, we propose an unsupervised labeling(More)
We propose a postfilter based on a generative adversarial network (GAN) to compensate for the differences between natural speech and speech synthesized by statistical parametric speech synthesis. In particular, we focus on the differences caused by over-smoothing, which makes the sounds muffled. Over-smoothing occurs in the time and frequency directions and(More)
This paper proposes a technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signal. The technique is based on style estimation and style adaptation using multiple-regression HMM. Recognition process consists of two stages. In the first stage, a style vector(More)
Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech(More)
This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic(More)
This paper describes the correlations between various acoustic features and perceptual voice quality similarity. We focus on identifying the acoustic features that are correlated with voice quality similarity. First, a large-scale perceptual experiment using the voices of 62 speakers is conducted and perceptual similarity scores between each pair of(More)
This paper proposes a technique for adding more prosodic variations to the synthetic speech in HMM-based expressive speech synthesis. We create novel phrase-level F0 context labels from the residual information of F0 features between original and synthetic speech for the training data. Specifically, we classify the difference of average log F0 values(More)
This paper describes a model adaptation technique for emotional speech recognition based on multiple-regression HMM (MR-HMM). We use a low-dimensional vector called style vector which corresponds the degree of expressivity of emotional speech as the explanatory variable of the regression. In the proposed technique, first, the value of the style vector for(More)
This paper presents a novel objective evaluation technique for statistical parametric speech synthesis. One of its novel features is that it focuses on the association between dimensions within the spectral features. We first use a maximal information coefficient to analyze the relationship between subjective scores and associations of spectral features(More)
As described in this paper, we propose a sub-band speech synthesis approach to develop a high quality Text-to-Speech (TTS) system: a sample-based spectrum is used in the high-frequency band and spectrum generated by HMM-based TTS is used in the low-frequency band. Herein, sample-based spectrum means spectrum selected from a phoneme database such that it is(More)