Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

@article{Toda2007VoiceCB,
  title={Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory},
  author={Tomoki Toda and Alan W. Black and Keiichi Tokuda},
  journal={IEEE Transactions on Audio, Speech, and Language Processing},
  year={2007},
  volume={15},
  pages={2222-2235}
}
  • T. Toda, A. Black, K. Tokuda
  • Published 1 November 2007
  • Computer Science, Mathematics
  • IEEE Transactions on Audio, Speech, and Language Processing
In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not… Expand
Modulation spectrum-constrained trajectory training algorithm for GMM-based Voice Conversion
TLDR
A novel training algorithm for Gaussian Mixture Model (GMM)-based Voice Conversion that enables a consistent optimization criterion between training and conversion and to compensate a Modulation Spectrum of the converted parameter trajectory as a feature sensitively correlated with over-smoothing effects causing quality degradation of the conversion speech. Expand
Modulation spectrum-based post-filter for GMM-based Voice Conversion
TLDR
An over-smoothing effect in Gaussian Mixture Model (GMM)-based Voice Conversion (VC) is addressed and an MS of speech parameters is degraded through GMM-based conversion process, and the proposed MS-based Post-Filter (MSPF) is applied. Expand
Exemplar-based voice conversion using non-negative spectrogram deconvolution
TLDR
Experiments indicate that the proposed voice conversion system based on non-negative spectrogram deconvolution outperforms the conventional joint density Gaussian mixture model by a wide margin in terms of both objective and subjective evaluations. Expand
A Revisit to Feature Handling for High-quality Voice Conversion Based on Gaussian Mixture Model
TLDR
An alternative filtering method, which is named SP-WORLD, inspired by the WORLD vocoder framework is introduced, and the subjective experiments demonstrate that SP- WORLD is comparable to MLSA filtering, and outperforms it in some cases. Expand
Voice Conversion Based on Trajectory Model Training of Neural Networks Considering Global Variance
TLDR
A consistent framework using the same criterion for both training and synthesis provides better conversion accuracy in the original static feature domain, and the over-smoothing can be avoided by optimizing the DNN parameters on the basis of the trajectory likelihood considering the GV. Expand
Modulation spectrum-constrained trajectory training algorithm for HMM-based speech synthesis
TLDR
A novel training algorithm for Hidden Markov Model (HMM)-based speech synthesis that yields improvements in synthetic speech quality while preserving a capability of the computationallyefficient generation processing. Expand
Voice conversion based on Gaussian processes by using kernels modeling the spectral density with Gaussian mixture models
  • J. Bao, N. Xu
  • Computer Science
  • Modern Physics Letters B
  • 2018
TLDR
This paper attempts to improve the flexibility of GP-based VC by resorting to the expressive kernels that are derived to model the spectral density with Gaussian mixture model (GMM). Expand
Incorporating global variance in the training phase of GMM-based voice conversion
TLDR
The proposed maximum likelihood-based trajectory mapping incorporating GV in the training phase of GMM-based VC provides comparable converted speech quality with reduced computational cost in the conversion process, compared to MLGV- based trajectory mapping. Expand
A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models
TLDR
The proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMMbased VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. Expand
Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data
TLDR
To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 59 REFERENCES
Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter
  • T. Toda, A. Black, K. Tokuda
  • Mathematics, Computer Science
  • Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.
  • 2005
TLDR
Experimental results show that the performance of the voice conversion can be improved by using the global variance information, and it is demonstrated that the proposed algorithm is more effective than spectral enhancement by postfiltering. Expand
Continuous probabilistic transform for voice conversion
TLDR
The design of a new methodology for representing the relationship between two sets of spectral envelopes and the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods. Expand
Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum
  • T. Toda, H. Saruwatari, K. Shikano
  • Computer Science
  • 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221)
  • 2001
TLDR
Results of the evaluation experiments clarify that the converted speech quality is better than that of the GMM-based algorithm, and the conversion-accuracy on speaker individuality is the same as that of this proposed method with the properly-weighted residual spectrum. Expand
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
TLDR
A generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance of the generated trajectory works as a penalty for the over-smoothing. Expand
Spectral voice conversion for text-to-speech synthesis
  • A. Kain, Michael W. Macon
  • Computer Science
  • Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
  • 1998
TLDR
A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented and is found to perform more reliably for small training sets than a previous approach. Expand
Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction
  • A. Kain, Michael W. Macon
  • Computer Science
  • 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221)
  • 2001
TLDR
Results show that the speaker identity of speech whose LPC spectrum has been converted can be recognized as the target speaker with the same level of performance as discriminating between LPC coded speech, however, the level of discrimination of converted utterances produced by the full VC system is significantly below that of speaker discrimination of natural speech. Expand
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model
TLDR
Experimental results demonstrate that the MLE- based mapping with dynamic features can significantly improve the mapping performance compared with the MMSE-based mapping in both the articulatory-to-acoustic mapping and the inversion mapping. Expand
Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis
TLDR
Experimental results show that MLE using both static and dynamic features can improve the mapping accuracy compared with the conventional GMM-based mapping. Expand
Quality-enhanced voice morphing using maximum likelihood transformations
  • H. Ye, S. Young
  • Computer Science
  • IEEE Transactions on Audio, Speech, and Language Processing
  • 2006
TLDR
A general maximum likelihood framework is proposed for transform estimation which avoids the need for parallel training data inherent in conventional least mean square approaches and shows that the proposed approaches are capable of effectively transforming speaker identity whilst maintaining high quality. Expand
Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks
TLDR
A speech spectrum transformation method by interpolating multi-speakers' spectral patterns and multi-functional representation with Radial Basis Function networks to generate new spectrum patterns close to those of the target speaker. Expand
...
1
2
3
4
5
...