Masami Akamine

Learn More
This paper provides a new method for automatically generating speech synthesis units. The algorithm, called Closed-Loop Training (CLT), is based on evaluating and reducing the distortion in synthesized speech. It minimizes distortion caused by synthesis process such as prosodic modification in an analytic way. The distortion is measured by calculating the(More)
Audiobook data is a freely available source of rich expressive speech data. To accurately generate speech of this form, expressiveness must be incorporated into the synthesis system. This paper investigates two parts of this process: the representation of expressive information in a statistical parametric speech synthesis system; and whether discrete(More)
A new parameter estimation method for the Model-Based Feature Enhancement (MBFE) is presented. The conventional MBFE uses the vector Taylor series to calculate the parameters of non-linearly transformed distributions, though the linearization leads to a degraded performance. We use the unscented transformation to estimate the parameters, where a minimal(More)
CELP coders using pulse codebooks for excitations such as ACELP[1] have the advantages of low complexity and high speech quality. At low bit rates, however, the decrease of pulse position candidates and the number of pulses degrades reconstructed speech quality. This paper describes a method for adaptive allocating of pulse position candidates. In the(More)
This paper presents a summary of our research progress using decision-tree acoustic models (DTAM) for large vocabulary speech recognition. Various configurations of training DTAMs are proposed and evaluated on wall-street journal (WSJ) task. A number of different acoustic and categorical features have been used for this purpose. Various ways of realizing a(More)
This paper presents a novel approach to factorize and control different speech factors in HMM-based TTS systems. In this paper cluster adaptive training (CAT) is used to factorize speaker identity and expressiveness (i.e. emotion). Within a CAT framework, each speech factor can be modelled by a different set of clusters. Users can control speaker identity(More)
Statistical parametric synthesizers usually rely on a simplified model of speech production where a minimum-phase filter is driven by a zero or random phase excitation signal. However, this procedure does not take into account the natural mixed-phase characteristics of the speech signal. This paper addresses this issue by proposing the use of the complex(More)
The most reliable way to build synthetic voices for end-products is to start with high quality recordings from professional voice talents. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) to combine a small number of these high quality corpora to make best use of them and improve(More)
Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this letter is to investigate the joint use of source and(More)