• Corpus ID: 245853923

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

  title={MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder},
  author={Shoutong Wang and Jinglin Liu and Yi Ren and Zhen Wang and Changliang Xu and Zhou Zhao},
Multi-speaker singing voice synthesis is to generate the singing voice sung by different speakers. To generalize to new speakers, previous zero-shot singing adaptation methods obtain the timbre of the target speaker with a fixed-size embedding from single reference audio. However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the… 

Figures and Tables from this paper

U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity
U-Singer is proposed, the first multi-singer emotional singing voice synthesizer that expresses various levels of emotional intensity and applies emotion embedding interpolation and extrapolation techniques that lead the model to learn a linear embedding space and allow the models to express emotional intensity levels not included in the training data.


DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System
This paper introduces a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data, and unifies the features used in standard speech synthesis system and singing synthesis system.
PPG-Based Singing Voice Conversion with Adversarial Representation Learning
  • Zhonghao Li, Benlai Tang, Zejun Ma
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An end-to-end architecture is built, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms to supply acoustic and musical information and an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model.
Singing Voice Conversion with Non-parallel Data
This paper proposes using a parallel data free, many-to-one voice conversion technique on singing voices that uses non parallel data to train a singing voice conversion system.
Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network
The proposed Pitch-Net added an adversarially trained pitch regression network to enforce the encoder network to learn pitch invariant phoneme representation, and a separate module to feed pitch extracted from the source audio to the decoder network.
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaSpeech is proposed, an adaptive TTS system for high-quality and efficient customization of new voices and achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Unsupervised Singing Voice Conversion
Evidence that the conversion produces natural signing voices that are highly recognizable as the target singer is presented, as well as new training losses and protocols that are based on backtranslation.
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling, follows the main architecture of FastSpeech while proposing some singing-specific design which demonstrates the overwhelming advantages of XiaoiceSing.
AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines
A large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi- Speakers Text-to-Speech systems and a robust synthesis model that is able to achieve zero-shot voice cloning is presented.
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Attentron is proposed, a few-shot TTS model that clones voices of speakers unseen during training that significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.