Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree
In this paper, we describe research in fundamental frequency modeling based on a statistical learning technique called additive models. A two-layer additive F0 model consists of a long-term, intonational phrase-level component, and a short-term, accentual phrase-level component. It can be learned from the data using a backfitting algorithm, an optimizer of a penalized least-square criterion defined on the model. It estimates two components simultaneously by iteratively applying cubic spline smoothers. To investigate the further flexibility of the model, we incorporated a third additive term that represents a contextual effect on an accentual phrase, and confirmed the improvements in terms of RMS errors. Experimental results on a 7,000 utterance Japanese speech corpus shows an achievement of F0 RMS errors of 28.5 and 29.3 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.81 and 0.79.