Unsupervised Cross-Domain Singing Voice Conversion

@inproceedings{Polyak2020UnsupervisedCS,
  title={Unsupervised Cross-Domain Singing Voice Conversion},
  author={Adam Polyak and Lior Wolf and Yossi Adi and Yaniv Taigman},
  booktitle={INTERSPEECH},
  year={2020}
}
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator. The proposed generative architecture is invariant to the speaker's identity and can be trained to generate target singers from unlabeled training data, using either speech or singing sources. The model is optimized in an… 
Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding
TLDR
Experiments show that the proposed model can significantly improve the naturalness of converted singing voices and the similarity with the target singer, and can also make the speakers with just speech data sing.
Semi-Supervised Learning for Singing Synthesis Timbre
  • J. Bonada, M. Blaauw
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only, without any annotations such as phonetic segmentation. Our system is an encoder-decoder model
PPG-Based Singing Voice Conversion with Adversarial Representation Learning
  • Zhonghao Li, Benlai Tang, +4 authors Zejun Ma
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
An end-to-end architecture is built, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms to supply acoustic and musical information and an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model.
Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control
TLDR
Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
TLDR
Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that the StarGAN v2 model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech based voice conversion methods without the need for text labels.
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion
TLDR
DiffSVC, an SVC system based on denoising diffusion probabilistic model that uses phonetic posteriorgrams (PPGs) as content features, can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
Controllable and Interpretable Singing Voice Decomposition via Assem-VC
TLDR
This work synthesizes singing voices from linguistic, melodic, and temporal information, forcing users to control down to the smallest details, which bars the general public without musical expertise from express their creativity.
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
TLDR
To generate disentangled representation, low-bitrate representations are extracted for speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner using self-supervised discrete representations for speech resynthesis.
fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit
TLDR
FAIRSEQ S is presented, a FAIR SEQ extension for speech synthesis that implements a number of autoregressive and non-AR text-to-speech models, and their multi-speaker variants, and a suite of automatic metrics.
On Generative Spoken Language Modeling from Raw Audio
TLDR
Generative Spoken Language Modeling is introduced, the task of learning the acoustic and linguistic characteristics of a language from raw audio and a set of metrics to automatically evaluate the learned representations atoustic and linguistic levels for both encoding and generation.
...
1
2
...

References

SHOWING 1-10 OF 56 REFERENCES
Unsupervised Singing Voice Conversion
TLDR
Evidence that the conversion produces natural signing voices that are highly recognizable as the target singer is presented, as well as new training losses and protocols that are based on backtranslation.
WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN
TLDR
A deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm, which facilitates the modelling of the large variability of pitch in the singing voice.
Singing Voice Conversion with Non-parallel Data
  • Xin Chen, Wei Chu, Jinxi Guo, N. Xu
  • Engineering, Computer Science
    2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)
  • 2019
TLDR
This paper proposes using a parallel data free, many-to-one voice conversion technique on singing voices that uses non parallel data to train a singing voice conversion system.
Personalized Singing Voice Generation Using WaveRNN
TLDR
Experimental results suggest that the personalized SVG framework outperforms the traditional conversion-vocoder pipeline in the subjective and objective evaluations.
TTS Skins: Speaker Conversion via ASR
TLDR
This work trains a fully convolutional wav-to-wav network for converting between speakers' voices, without relying on text, and demonstrates multi-voice TTS in those voices, by converting the voice of a TTS robot.
Crepe: A Convolutional Representation for Pitch Estimation
TLDR
This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.
Fitting New Speakers Based on a Short Untranscribed Sample
TLDR
This work presents a method that is designed to capture a new speaker from a short untranscribed audio sample by employing an additional network that given an audio sample, places the speaker in the embedding space.
Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network
TLDR
The proposed Pitch-Net added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.
Sequence-to-Sequence Singing Synthesis Using the Feed-Forward Transformer
  • M. Blaauw, J. Bonada
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This work proposes a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features, and evaluates the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the accuracy of the duration model.
Applying voice conversion to concatenative singing-voice synthesis
TLDR
This work address the application of Voice Conversion to singing-voice by applying the GMM-based approach to VOCALOID, a concatenative singing synthesizer, to perform singer timbre conversion, achieving a satisfactory conversion effect on the synthesized utterances.
...
1
2
3
4
5
...