• Corpus ID: 239049661

Identity Conversion for Emotional Speakers: A Study for Disentanglement of Emotion Style and Speaker Identity

  title={Identity Conversion for Emotional Speakers: A Study for Disentanglement of Emotion Style and Speaker Identity},
  author={Zongyang Du and Berrak Sisman and Kun Zhou and Haizhou Li},
Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and speaker-dependent emotion style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the speaker-dependent emotional style for expressive voice conversion. Motivated by the recent success on speaker disentanglement with variational autoencoder (VAE), we propose an expressive voice conversion framework which can effectively disentangle… 

Figures and Tables from this paper


Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer
This paper proposes a StarGAN-based framework to learn a many-to-many mapping across different speakers, that takes into account speaker-dependent emotional style without the need for parallel data, and is the first study on expressive voice conversion.
Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech
This paper proposes a speaker-dependent EVC framework based on VAW-GAN, that includes a spectral encoder that disentangles emotion and prosody (F0) information from spectral features and a prosodic encoder which disentangled emotion modulation of prosody from linguistic prosody.
Intra-class variation reduction of speaker representation in disentanglement framework
The proposed criteria reduce the variation of speaker characteristics caused by changes in background envi-ronment or spoken content, the resulting embeddings of each speaker become more consistent and the effectiveness of the pro-posed method is demonstrated.
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion
One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation
Again-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization
This work proposes AGAIN-VC, an innovative VC system using Activation Guidance and Adaptive Instance Normalization, an auto-encoder-based model, comprising of a single encoder and a decoder.
Comparison of speaker dependent and speaker independent emotion recognition
Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused, and emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration.
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
This paper proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source andtarget speaker do not even need to be seen during training.
Voice conversion from non-parallel corpora using variational auto-encoder
An SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora and removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system is proposed.
Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks
This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.
Group Sparse Representation With WaveNet Vocoder Adaptation for Spectrum and Prosody Conversion
This paper uses phonetic posteriorgrams (PPGs) together with spectral and prosody features to form tandem feature in the phonetic dictionary that allow us to estimate an activation matrix that is less dependent on source speakers, thus providing a better voice conversion quality.