Articulatory Synthesis of Fricative Consonants: Data and Models

Abstract

The present work aims at demonstrating the feasibility of high quality articulatory synthesis for fricative consonants, and in particular to match a given reference subject. The synthesiser includes an articulatory model based on cineradiographic pictures of the subject, and a simplified aerodynamic model. Two approaches have been used: direct articulatory copy synthesis, and copy synthesis by acoustic-toarticulatory inversion. Coordination between supralaryngeal and laryngeal articulators has been quasi-automatically determined, based on supplementary aerodynamic data. A set of VFV spatiotemporal examplars has finally been built, and should serve to establish sensory-motor templates for synthesis. Introduction We believe that articulatory synthesis is a promising approach to speech synthesis, because its anthropomorphic nature allows to adapt, in a coherent fashion, the synthesis strategies to the environmental conditions. The present work aimed thus at demonstrating the feasibility of high quality articulatory synthesis for fricative consonants, and in particular the possibility to match a given reference subject. This study relies on two complementary approaches, namely direct articulatory copy synthesis and inversion. It involves articulatoryaerodynamic-acoustic data on the one hand, and relevant models on the other hand. 1. The articulatory-aerodynamic-acoustic data and the articulatory synthesiser A reference subject uttered the same small French vowels, and VCV sequences of voiced plosives and fricatives in different setup conditions (cf. Badin et al., 1995a, b). Midsagittal contours were obtained by cineradiography, in synchrony with front views of the lips recorded by video. The low frequency components of both volume velocity at the lips U and intra-oral pressure ∆Pc were recorded in a different session by means of a Rothenberg mask , and the minimal oral constriction area Ac_areo was determined by the orifice equation . Formant trajectories were also determined by carefully hand-editing poles extracted from LPC coefficients. B e r g a m e , the ICP articulatory synthesiser was developed based on these data (Beautemps et al. , 1996). The first module is a physiologically-oriented statistical articulatory model, basically driven by eight parameters: jaw height JH, lip height LH and protrusion LP, tongue advance TA, body TB, dorsum TD and tip TT, and larynx height LY. The second module is a model of passage from the midsagittal function to the area function, also optimised on the same data (Beautemps et al., 1996). Finally, the resulting sound is produced by a timedomain reflection-type line analogue (Bailly et al. , 1994), excited by an improved two-mass model of the vocal folds (Vescovi et al., 1995), and a newly developed noise source for fricatives (Badin et al., 1995b). The noise source is controlled by the low frequency component of the pressure drop ∆ P c at the oral constriction and by the aerodynamically equivalent constriction area Ac (either the 1st ESCA Tutorial and Research Workshop on Speech Production Modeling – 4 th Speech Production Seminar 222 minimal constriction area in the tract Acl [excluding the larynx and the lips], or the lip area Al). A simplified aerodynamic model, valid at low frequencies (below approximately 100 Hz) considers the vocal tract as two constrictions: the glottis and the oral constriction. Bernouilli and Poiseuille equations are used to express ∆Pc as a function of Ac, and the pressure drop ∆Pg across the glottis as a function of Ag, where Ag is the low-frequency component of the glottal area determined in the two-mass model. The subglottal pressure Ps is then equal to the sum of ∆Pg and ∆Pc. The articulatory synthesiser is globally controlled by two sets of articulatory parameters: supralaryngeal parameters (i.e. the command parameters of the articulatory model), and laryngeal parameters controlling the vocal folds (subglottal pressure PS, vocal folds length LG, glottis rest height H0), that need to be carefully coordinated. The aim of the present study being to replicate natural VFV fricative sequences, two main strategies were explored, concerning the supralaryngeal articulators: direct articulatory copy synthesis, and acoustic-to-articulatory inversion. These approaches are described in the following sections, as well as the method employed to establish the supralaryngeal–laryngeal coordination. 2. Direct articulatory copy synthesis This strategy consisted in mimicking the subject’s articulation as closely as possible by direct measurements. Five of the parameters were thus directly measured on the contours: JH, LH, LP, TA, and LY. The other three tongue parameters, TB, TD and TT, were obtained by a pseudo-inversion of the matrix that predicts the coordinates of the tongue contour as linear combinations of them (Badin et al. , 1995a). Finally, midsagittal profiles, and then area functions were computed from these parameters, using the articulatory model. This strategy is limited to the resynthesis of the items of the initial corpus (8 vowels, 3 voiced fricatives in 6 vocalic contexts). It serves the purpose of assessing how close the whole model chain is to the reference subject. An evaluation can be found in Beautemps et al. (1996). In particular, the square root of the quadratic errors on formants are respectively 49, 130, 145 and 200 Hz for F1, F2, F3 and F4, which is a quite reasonable fit. This shows that the articulatory synthesiser fits fairly well the characteristics of the reference subject and provides a good basis for further studies. 3. Copy synthesis by acoustic-toarticulatory inversion We resorted to an inversion method, in order to overcome the limitations of direct copy synthesis, and to be able to mimic any sequence for which only the radiated sound would be available (possibly including lip parameters and aerodynamic measurements). The articulatory parameters were thus determined from measured formant trajectories, and from the specification of some geometrical parameters, by means of a classical gradient descent method (Jordan, 1990): the algorithm aims at minimising the distance between the distal parameters (formants and geometric parameters) by finding the best proximal parameters (the command parameters). A forward model of the articulatory model was thus established: each of the four formant frequencies in a dictionary produced with the direct model, was modelled by a separate fourth order polynomial function of the eight articulatory parameters (Morris, 1992); similarly for the two geometrical parameters, i.e. Acl and A l. The error to minimise is the weighted sum of (1) the quadratic distances between the six distal parameters (formants are expressed in barks , areas are saturated by arc tg functions) and the measured parameters (actually, the distance is a parabolic function on each side of a don’t care range where it is set to zero, limited by minimum and maximum target values), for all the frames in the sequence, and (2) the jerk of the proximal parameters. As speech production involves simultaneously different spaces, i.e. the articulatory, geometric, aerodynamic and acoustic spaces, a trend toward a multilayered representation of speech is developing (cf. Bailly, 1996). In particular, it is clear that vowels are more precisely and economically represented in terms of formants, whereas consonants are better represented in terms of place and degree of constriction/closure. Therefore, in our inversion procedure, we specified vowels in terms of formants, letting the Acl and Al parameters practically unspecified. On the other hand, the fricatives were coded in terms of degree of constriction: the upper limit of Acl or Al was set to a high value (typically 1-5 cm2) for vowels, and to 0.15 cm2 for fricatives, while the lower limit was set to 0.05 cm2 in order to avoid complete closure. Boundaries between vowels and fricatives were determined from the sound pressure level at the lips by appropriate thresholding, using the fact that the energy of the vowels is much higher than that of the consonants. The transitions 1st ESCA Tutorial and Research Workshop on Speech Production Modeling – 4 th Speech Production Seminar 223 0 10 20 cm 0 5 10 15 20 cm

3 Figures and Tables

Cite this paper

@inproceedings{Badin1996ArticulatorySO, title={Articulatory Synthesis of Fricative Consonants: Data and Models}, author={Parvin Badin and Khaled Mawass and Gilles Bailly and Christophe Vescovi and Denis Beautemps and Xavier Pelorson}, year={1996} }