Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion

@inproceedings{Huang2019InvestigationOF,
  title={Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion},
  author={Wen-Chin Huang and Yi-Chiao Wu and Chen-Chou Lo and Patrick Lumban Tobing and Tomoki Hayashi and Kazuhiro Kobayashi and Tomoki Toda and Yu Tsao and H. Wang},
  booktitle={INTERSPEECH},
  year={2019}
}
In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent… 

Figures and Tables from this paper

Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion
TLDR
This article extends the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.
Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech
TLDR
This paper proposes a speaker-dependent EVC framework based on VAW-GAN, that includes a spectral encoder that disentangles emotion and prosody (F0) information from spectral features and a prosodic encoder which disentangled emotion modulation of prosody from linguistic prosody.
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation
TLDR
A novel StarGAN framework along with a two-stage training process that separates emotional features from those independent of emotion by using an autoencoder with two encoders as the generator of the Generative Adversarial Network (GAN) reveals that the proposed model can effectively reduce distortion.
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
TLDR
This paper proposes a singing voice conversion framework that is based on VAW-GAN, and trains an encoder to disentangle singer identity and singing prosody from phonetic content and achieves better performance than the baseline frameworks.
DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding
TLDR
A deep neural analyzer is proposed, denoted as DeepA – a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulate those defined in conventional vocoders, and the resulting parameters are more interpretable than other latent neural representations.
MoEVC: A Mixture of Experts Voice Conversion System With Sparse Gating Mechanism for Online Computation Acceleration
TLDR
A novel mixture-of-experts (MoE) based VC system that can skip some convolution processes through elimination of redundant feature maps, thereby accelerating online computing and improving VC performance in both objective evaluation and human subjective listening tests is proposed.
MoEVC: A Mixture-of-experts Voice Conversion System with Sparse Gating Mechanism for Accelerating Online Computation
TLDR
Experimental results show that by specifying suitable sparse constraints, a novel mixture-of-experts (MoE) based VC system can effectively increase the online computation efficiency with a notable 70% FLOPs (floating-point operations per second) reduction while improving the VC performance in both objective evaluations and human listening tests.
Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion
TLDR
This modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) is presented so that it can be directly applied as a waveform generation module to voice conversion models and can be generalized to any spectral conversion model.
Submission from SRCB for Voice Conversion Challenge 2020
TLDR
This work focuses on building a voice conversion system achieving consistent improvements in accent and intelligibility evaluations, and extracts general phonation from the source speakers' speeches of different languages, and improves the sound quality by optimizing the speech synthesis module and adding a noise suppression post-process module to the vocoder.
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
TLDR
This paper proposes a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for parallel data, and proposes a VAW-GAN based encoder-decoder structure to learn the spectrum and prosody mapping.

References

SHOWING 1-10 OF 30 REFERENCES
Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders
TLDR
Experimental results demonstrate that the proposed CD-VAE framework outperforms the conventional VAE framework in terms of subjective tests and also improves the capability of VAE for VC.
ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder
TLDR
This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE), which adopts fully convolutional architectures and avoids producing buzzy-sounding speech at test time by simply transplanting the spectral details of the input speech into its converted version.
ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion
TLDR
A voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech.
Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training
TLDR
A DNN is used to construct a global non-linear mapping relationship between the spectral envelopes of two speakers to significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.
Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks
TLDR
This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.
Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks
TLDR
The proposed SVC framework uses a similarity metric implicitly derived from a generative adversarial network, enabling the measurement of the distance in the high-level abstract space to mitigate the oversmoothing problem caused in the low-level data space.
Locally Linear Embedding for Exemplar-Based Spectral Conversion
TLDR
The results of subjective evaluation conducted by the vcc2016 organizer show that the LLE-exemplarbased SC system notably outperforms the baseline GMMbased system and the three additional approaches with improved speech quality.
Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
TLDR
An adversarial learning framework for voice conversion is proposed, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals.
The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods
TLDR
A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.
Sequence-to-Sequence Acoustic Modeling for Voice Conversion
TLDR
Experimental results show that the proposed neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) obtained better objective and subjective performance than the baseline methods using Gaussian mixture models and deep neural networks as acoustic models.
...
...