Noise-robust voice conversion with domain adversarial training

@article{Du2022NoiserobustVC,
  title={Noise-robust voice conversion with domain adversarial training},
  author={Hongqiang Du and Lei Xie and Haizhou Li},
  journal={Neural networks : the official journal of the International Neural Network Society},
  year={2022},
  volume={148},
  pages={
          74-84
        }
}
  • Hongqiang DuLei XieHaizhou Li
  • Published 1 January 2022
  • Computer Science
  • Neural networks : the official journal of the International Neural Network Society

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

A noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers using a latent feature space where it is ensured that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator.

Preserving background sound in noise-robust voice conversion via multi-task learning

Experimental results demonstrate that the proposed end-to-end framework via multi-task learning outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.

References

SHOWING 1-10 OF 59 REFERENCES

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

An adversarial learning framework for voice conversion is proposed, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals.

Adversarial Feature Learning and Unsupervised Clustering Based Speech Synthesis for Found Data With Acoustic and Textual Noise

An approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data and a VQVAE based heuristic method to compensate erroneous linguistic feature with phonetic information learned directly from speech is proposed.

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

This paper proposes a deep discriminative speaker encoder that can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

This paper proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source andtarget speaker do not even need to be seen during training.

Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization

Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers.

Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition

Experiments demonstrate that the proposed domain adversarial training method is not only effective in solving the dataset mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.

Variational Domain Adversarial Learning for Speaker Verification

Experiments on both SRE16 and SRE18-CMN2 show that VDANN outperforms the Kaldi baseline and the standard DANN, and results suggest that VAE regularization is effective for domain adaptation.

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

A new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck is proposed, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data and is the first to perform zero-shot voice conversion.

Optimizing Voice Conversion Network with Cycle Consistency Loss of Speaker Identity

A novel training scheme to optimize voice conversion network with a speaker identity loss function that reduces frame-level spectral loss and introduces a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utterance level.

One-Shot Voice Conversion For Style Transfer Based On Speaker Adaptation

  • Zhichao WangQicong Xie Mengxiao Bi
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
This paper proposes a one-shot voice conversion approach for style transfer based on speaker adaptation and adopts weight regularization in the adaptation process to prevent over-fitting caused by using only one utterance from target speaker as training data.
...