StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

  title={StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion},
  author={Takuhiro Kaneko and H. Kameoka and Kou Tanaka and Nobukatsu Hojo},
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. [] Key Method To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and…

Figures and Tables from this paper

Boosting Star-GANs for Voice Conversion with Contrastive Discriminator

The state-of-the-art contrastive learning techniques are used and in-corporate an efficient Siamese network structure into the StarGAN discriminator to boost the training stability and prevent the discriminator overfit-ting issue in the training process.

Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks

Three formulations of StarGAN are described, including a newly introduced novel StarGAN variant called “Augmented classifier StarGAN (A-StarGAN)”, and they are compared in a nonparallel VC task and compared with several baseline methods.

Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN

An improved method named “PSR-StarGAN-VC” is proposed, which utilizes a variant of Generative Adversarial Networks to perform non-parallel many-to-many VC and demonstrates superiority of the proposed method in terms of naturalness and speaker similarity.

Non-Parallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks

Three formulations of StarGAN are described, including a newly introduced novel StarGAN variant called "Augmented classifier StarGAN (A-StarGAN)", and they are compared in a non-parallel VC task.

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that the StarGAN v2 model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech based voice conversion methods without the need for text labels.

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

This paper proposes a new voice conversion framework, i.e. Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC), which converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands.

StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts

StarGAN-ZSVC is extended, showing that real-time zero-shot voice conversion is possible even for a model trained on very little data, and comparing it against other voice conversion techniques in a low-resource setting using a small 9-minute training set.

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

The applicability of CycleGAN-VC/VC2 to mel-spectrogram conversion was examined and it was discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion.

Feature Quantization for Many-to-many Voice Conversion

A Feature Quantization model plugged into the discriminator of StarGAN-VC2, which can quantize the continuous feature into a discrete embedding space to solve the feature mapping problem and improve the quality of converted speech is proposed.

Maskcyclegan-VC: Learning Non-Parallel Voice Conversion with Filling in Frames

MaskCycleGAN-VC is proposed, which is another extension of Cyclegan-VC2 and is trained using a novel auxiliary task called filling in frames (FIF), which allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN.



Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion

CycleGAN-VC2 is proposed, which is an improved version of CycleGAN- VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN).

CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks

A non-parallel voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data is proposed that is general purpose and high quality and works without any extra data, modules, or alignment procedure.

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

This work uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss to learn a mapping from source to target speech without relying on parallel data.

Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks

This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.

ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

A voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech.

Non-native speech conversion with consistency-aware recursive network and generative adversarial network

Through subjective and quantitative evaluations, the superiority of the proposed method over a conventional NN approach in terms of conversion quality is confirmed.

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

The proposed VC framework can be trained in only one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the synthesized speech is higher than that ofspeech converted by Gaussian mixture model-based VC and is comparable to that of speech generated by recurrent neural network-based text-to-speech synthesis.

Generative Adversarial Network-Based Postfilter for STFT Spectrograms

The results show that the proposed generative adversarial network-based postfilter can be used to reduce the gap between synthesized and target spectra, even in the high-dimensional STFT domain.

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

A brief summary of the state-of-the-art techniques for VC is presented, followed by a detailed explanation of the challenge tasks and the results that were obtained.

ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder

This paper confirmed experimentally that the proposed method outperformed baseline non-parallel VC systems and performed comparably to an open-source parallel VC system trained using a parallel corpus in a speaker identity conversion task.