• Corpus ID: 238856863

Toward Degradation-Robust Voice Conversion

@article{Huang2021TowardDV,
  title={Toward Degradation-Robust Voice Conversion},
  author={Chien-Yu Huang and Kai-Wei Chang and Hung-yi Lee},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.07537}
}
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. Although there have been several state-of-the-art any-to-any voice conversion models, they were all based on clean utterances to convert successfully. However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. It thus becomes highly desired to understand how these degradations… 

Tables from this paper

References

SHOWING 1-10 OF 30 REFERENCES
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments
TLDR
Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments, and is capable of performing non-parallel zero-shot VC.
How Far Are We from Robust Voice Conversion: A Survey
TLDR
It is found that the sampling rate and audio duration greatly influence voice conversion, and all the VC models suffer from unseen data, but AdaIN-VC is relatively more robust.
Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention
TLDR
Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC, and is accomplished end-to-end.
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
TLDR
This paper proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source andtarget speaker do not even need to be seen during training.
Zero-Shot Voice Style Transfer with Only Autoencoder Loss
TLDR
A new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck is proposed, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data and is the first to perform zero-shot voice conversion.
Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
TLDR
An adversarial learning framework for voice conversion is proposed, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals.
Towards Achieving Robust Universal Neural Vocoding
TLDR
A WaveRNN-based vocoder is shown to be capable of generating speech of consistently good quality regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality.
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
TLDR
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors.
StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks
TLDR
Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
...
1
2
3
...