StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts

  title={StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts},
  author={Matthias Baas and Herman Kamper},
Voice conversion is the task of converting a spoken utterance from a source speaker so that it appears to be said by a different target speaker while retaining the linguistic content of the utterance. Recent advances have led to major improvements in the quality of voice conversion systems. However, to be useful in a wider range of contexts, voice conversion systems would need to be (i) trainable without access to parallel data, (ii) work in a zero-shot setting where both the source and target… 

Voice Conversion Can Improve ASR in Very Low-Resource Settings

This work combines several recent techniques to design and train a practical VC system in English, and then uses this system to augment data for training speech recognition models in several low-resource languages.

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

A novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training, with an accuracy comparable to state-of-the-art methods, and without training is presented.



StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks

Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

A new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck is proposed, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data and is the first to perform zero-shot voice conversion.

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

This work rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called Stargan-VC2, which improves speech quality in terms of both global and local structure measures and introduces a modulation-based conditional method.

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

  • Yi ZhaoWen-Chin Huang T. Toda
  • Computer Science
    Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020
  • 2020
From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.

Defending Your Voice: Adversarial Attack on Voice Conversion

The first known attempt to perform adversarial attack on voice conversion is reported, which introduces human imperceptible noise into the utterances of a speaker whose voice is to be defended.

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.

ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network

ConVoice can convert speech of any length without compromising quality due to its convolutional architecture, and has comparable quality to similar state-of-the-art models while being extremely fast.

ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder

This paper confirmed experimentally that the proposed method outperformed baseline non-parallel VC systems and performed comparably to an open-source parallel VC system trained using a parallel corpus in a speaker identity conversion task.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.