Boosting Star-GANs for Voice Conversion with Contrastive Discriminator

@article{Si2022BoostingSF,
  title={Boosting Star-GANs for Voice Conversion with Contrastive Discriminator},
  author={Shijing Si and Jianzong Wang and Xulong Zhang and Xiaoyang Qu and Ning Cheng and Jing Xiao},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.10088}
}
. Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and in-corporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 37 REFERENCES

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

This work rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called Stargan-VC2, which improves speech quality in terms of both global and local structure measures and introduces a modulation-based conditional method.

StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks

Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

Towards Speaker Age Estimation With Label Distribution Learning

This work utilizes the ambiguous information among the age labels, converts each age label into a discrete label distribution and leverage the label distribution learning (LDL) method to fit the data.

Avqvc: One-Shot Voice Conversion By Vector Quantization With Applying Contrastive Learning

A novel one-shot voice conversion framework based on vector quantization voice conversion (VQVC) and AutoVC is proposed, called AVQVC, and a new training method is applied to VQVC to separate content and timbre information from speech more effectively.

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

A novel voice conversion framework, named Text Guided AutoVC(TGAVC), is proposed to more effectively separate content and timbre from speech, where an expected content embedding produced based on the text transcriptions is designed to guide the extraction of voice content.

Variational Information Bottleneck for Effective Low-resource Audio Classification

Evaluation on a few audio datasets shows that the VIB framework is ready-to-use and could be easily utilized with many other state-of-the-art network architectures, and outperforms baseline methods.

Speech2Video: Cross-Modal Distillation for Speech to Video Generation

A light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs and outperforms the existing algorithms in terms of emotion expression in the generated videos.

Training GANs with Stronger Augmentations via Contrastive Discriminator

This paper proposes a novel way to address the discriminator overfitting issue in GANs by incorporating a recent contrastive representation learning scheme into the GAN discriminator, coined ContraD, and shows that GAns with ContraD consistently improve FID and IS compared to other recent techniques incorporating data augmentations.

Zero-Shot Singing Voice Conversion

In this paper, we propose the use of speaker embedding networks to perform zero-shot singing voice conversion, and suggest two architectures for its realization. The use of speaker embedding networks

Exploring Simple Siamese Representation Learning

  • Xinlei ChenKaiming He
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
Surprising empirical results are reported that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders.