RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses
@inproceedings{Xu2021RefineGANUG, title={RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses}, author={Shengyuan Xu and Wenxiao Zhao and Jing Guo}, booktitle={Interspeech}, year={2021} }
Most GAN(Generative Adversarial Network)-based approaches towards high-fidelity waveform generation heavily rely on discriminators to improve their performance. However, GAN methods introduce much uncertainty into the generation process and often result in mismatches of pitch and intensity, which is fatal when it comes to sensitive use cases such as singing voice synthesis(SVS). To address this problem, we propose RefineGAN, a high-fidelity neural vocoder focused on the robustness, pitch and…
5 Citations
HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
- Computer ScienceArXiv
- 2022
The experimental result shows that the proposed HiFi-WaveGAN outperforms other neural vocoders such as Parallel WaveGAN (PWG) and HiFiGAN in the mean opinion score (MOS) metric for the 48 kHz SVS task.
DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP
- Computer ScienceArXiv
- 2022
DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP) to eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase.
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher
- Computer ScienceINTERSPEECH
- 2022
Experiments show that the proposed Learn2Sing 2.0 is capable of synthesizing high-quality singing voice for the target speaker without singing data with 10 decoding steps.
VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer
- Computer ScienceArXiv
- 2022
Experimental results show that VISinger 2 substantially outperforms CpopSing, VISinger and RefineSinger in both subjective and objective metrics and incorporates a DSP synthesizer into the decoder to solve the above issues.
Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN
- Computer ScienceProceedings of the 4th International Conference on Advanced Information Science and System
- 2022
This paper proposes a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN), which considers the pointwise truism score distribution of real and fake wave segments and combines the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability.
References
SHOWING 1-10 OF 40 REFERENCES
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
- Computer ScienceNeurIPS
- 2020
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
- Computer ScienceNeurIPS
- 2019
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis
- Computer ScienceArXiv
- 2020
This paper introduces multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling and proposes a novel sub-frequency GAN on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands and models each sub-band with a separate discriminator.
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.
Adversarial Audio Synthesis
- Computer ScienceICLR
- 2019
WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.
Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer
- Computer ScienceINTERSPEECH
- 2020
Both objective and subjective evaluations indicate that the proposed synthesizer can generate higher quality singing voice than baseline, and the articulation of high-pitched vowels is significantly enhanced.
Crepe: A Convolutional Representation for Pitch Estimation
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.
MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms
- Computer ScienceArXiv
- 2019
MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice, is proposed and applied to perform music style transfer.
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
- Computer ScienceInterspeech
- 2021
UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time, is proposed and a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets is added.
XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System
- Computer ScienceINTERSPEECH
- 2020
XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling, follows the main architecture of FastSpeech while proposing some singing-specific design which demonstrates the overwhelming advantages of XiaoiceSing.