Data Efficient Voice Cloning for Neural Singing Synthesis

@article{Blaauw2019DataEV,
  title={Data Efficient Voice Cloning for Neural Singing Synthesis},
  author={Merlijn Blaauw and Jordi Bonada and Ryunosuke Daido},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={6840-6844}
}
  • M. Blaauw, J. Bonada, R. Daido
  • Published 19 February 2019
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
There are many use cases in singing synthesis where creating voices from small amounts of data is desirable. In text-to-speech there have been several promising results that apply voice cloning techniques to modern deep learning based models. In this work, we adapt one such technique to the case of singing synthesis. By leveraging data from many speakers to first create a multispeaker model, small amounts of target data can then efficiently adapt the model to new unseen voices. We evaluate the… Expand

Figures, Tables, and Topics from this paper

Zero-Shot Singing Voice Conversion
In this paper, we propose the use of speaker embedding networks to perform zero-shot singing voice conversion, and suggest two architectures for its realization. The use of speaker embedding networksExpand
Learn2Sing: Target Speaker Singing Voice Synthesis by Learning from a Singing Teacher
TLDR
The proposed approach is capable of synthesizing singing voice for target speaker given only their speech samples and employs domain adversarial training (DAT) in the acoustic model, which aims to enhance the singing performance of target speakers by disentangling style from acoustic features of singing and speaking data. Expand
Unsupervised Cross-Domain Singing Voice Conversion
TLDR
The proposed approach is fully-convolutional and can generate audio in real-time and significantly outperforms the baseline methods while generating convincingly better audio samples than alternative attempts. Expand
Speech Synthesis as Augmentation for Low-Resource ASR
TLDR
This paper investigates the possibility of using synthesized speech as a form of data augmentation to lower the resources necessary to build a speech recognizer. Expand
WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN
TLDR
A deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm, which facilitates the modelling of the large variability of pitch in the singing voice. Expand
Unsupervised Singing Voice Conversion
TLDR
Evidence that the conversion produces natural signing voices that are highly recognizable as the target singer is presented, as well as new training losses and protocols that are based on backtranslation. Expand
Exploring Cross-lingual Singing Voice Synthesis Using Speech Data
TLDR
Object evaluation and subjective listening tests demonstrate that the proposed cross-lingual SVS system can generate singing voice with decent naturalness and fair speaker similarity and it is found that adding singing data or multi-speaker monolingual speech data further improves generalization on pronunciation and pitch accuracy. Expand
Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer
TLDR
Both objective and subjective evaluations indicate that the proposed synthesizer can generate higher quality singing voice than baseline, and the articulation of high-pitched vowels is significantly enhanced. Expand
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System
TLDR
This paper introduces a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data, and unifies the features used in standard speech synthesis system and singing synthesis system. Expand
Learning Singing From Speech
TLDR
The proposed algorithm learns universal speaker embeddings that are shareable between speech and singing synthesis tasks, and generates high-quality singing voices that sound highly similar to target speaker's voice given only his or her normal speech samples. Expand
...
1
2
...

References

SHOWING 1-10 OF 21 REFERENCES
Fitting New Speakers Based on a Short Untranscribed Sample
TLDR
This work presents a method that is designed to capture a new speaker from a short untranscribed audio sample by employing an additional network that given an audio sample, places the speaker in the embedding space. Expand
Neural Voice Cloning with a Few Samples
TLDR
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment. Expand
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
TLDR
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications. Expand
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly. Expand
Integration of speaker and pitch adaptive training for HMM-based singing voice synthesis
TLDR
This paper proposes “singer adaptive training” which can solve the data sparse-ness problem and experimental results demonstrated that the proposed technique improved the quality of the synthesized singing voices. Expand
A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs
We recently presented a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoderExpand
Sample Efficient Adaptive Text-to-Speech
TLDR
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers. Expand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. Expand
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features. Expand
...
1
2
3
...