Unsupervised Cross-Domain Singing Voice Conversion

  title={Unsupervised Cross-Domain Singing Voice Conversion},
  author={Adam Polyak and Lior Wolf and Yossi Adi and Yaniv Taigman},
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator. The proposed generative architecture is invariant to the speaker's identity and can be trained to generate target singers from unlabeled training data, using either speech or singing sources. The model is optimized in an… 

Figures and Tables from this paper

Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

Experiments show that, compared with the baseline models, the proposed model can improve the naturalness of converted singing voices and the similarity with the target singer and can also make the speakers with just speech data sing.

Semi-Supervised Learning for Singing Synthesis Timbre

  • J. BonadaM. Blaauw
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only, without any annotations such as phonetic segmentation. Our system is an encoder-decoder model

PPG-Based Singing Voice Conversion with Adversarial Representation Learning

  • Zhonghao LiBenlai Tang Zejun Ma
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An end-to-end architecture is built, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms to supply acoustic and musical information and an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model.

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.

Robust One-Shot Singing Voice Conversion

A robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voices using less than 10s of a reference voice is proposed.

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that the StarGAN v2 model produces natural sounding voices, close to the sound quality of state-of-the-art text-tospeech based voice conversion methods without the need for text labels.

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

DiffSVC, an SVC system based on denoising diffusion probabilistic model that uses phonetic posteriorgrams (PPGs) as con-tent features, can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.

Controllable and Interpretable Singing Voice Decomposition via Assem-VC

This work synthesizes singing voices from linguistic, melodic, and temporal information, forcing users to control down to the smallest details, which bars the general public without musical expertise from express their creativity.

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data, which proves the effec- tiveness and advantages of UniSyn.

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

This work proposes a novel hierarchical speaker representation framework for SVC, which can capture coarse-grained speaker characteristics at different granularity and outperforms both the LUT and SRN based SVC systems.



Unsupervised Singing Voice Conversion

Evidence that the conversion produces natural signing voices that are highly recognizable as the target singer is presented, as well as new training losses and protocols that are based on backtranslation.

WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN

A deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm, which facilitates the modelling of the large variability of pitch in the singing voice.

Singing Voice Conversion with Non-parallel Data

This paper proposes using a parallel data free, many-to-one voice conversion technique on singing voices that uses non parallel data to train a singing voice conversion system.

Personalized Singing Voice Generation Using WaveRNN

Experimental results on the Nus-48E and NUS-HLT-SLS corpora suggest that the personalized SVG framework outperforms the traditional conversion-vocoder pipeline in the subjective and objective evaluations.

TTS Skins: Speaker Conversion via ASR

This work trains a fully convolutional wav-to-wav network for converting between speakers' voices, without relying on text, and demonstrates multi-voice TTS in those voices, by converting the voice of a TTS robot.

Crepe: A Convolutional Representation for Pitch Estimation

This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.

Fitting New Speakers Based on a Short Untranscribed Sample

This work presents a method that is designed to capture a new speaker from a short untranscribed audio sample by employing an additional network that given an audio sample, places the speaker in the embedding space.

Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

The proposed Pitch-Net added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.

Applying voice conversion to concatenative singing-voice synthesis

This work address the application of Voice Conversion to singing-voice by applying the GMM-based approach to VOCALOID, a concatenative singing synthesizer, to perform singer timbre conversion, achieving a satisfactory conversion effect on the synthesized utterances.

A Neural Parametric Singing Synthesizer

A new model for singing synthesis based on a modified version of the WaveNet architecture is presented, which allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times.