RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses

@inproceedings{Xu2021RefineGANUG,
  title={RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses},
  author={Shengyuan Xu and Wenxiao Zhao and Jing Guo},
  booktitle={Interspeech},
  year={2021}
}
Most GAN(Generative Adversarial Network)-based approaches towards high-fidelity waveform generation heavily rely on discriminators to improve their performance. However, GAN methods introduce much uncertainty into the generation process and often result in mismatches of pitch and intensity, which is fatal when it comes to sensitive use cases such as singing voice synthesis(SVS). To address this problem, we propose RefineGAN, a high-fidelity neural vocoder focused on the robustness, pitch and… 

Figures and Tables from this paper

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

The experimental result shows that the proposed HiFi-WaveGAN outperforms other neural vocoders such as Parallel WaveGAN (PWG) and HiFiGAN in the mean opinion score (MOS) metric for the 48 kHz SVS task.

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP) to eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase.

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Experiments show that the proposed Learn2Sing 2.0 is capable of synthesizing high-quality singing voice for the target speaker without singing data with 10 decoding steps.

VISinger 2: High-Fidelity End-to-End Singing Voice Synthesis Enhanced by Digital Signal Processing Synthesizer

Experimental results show that VISinger 2 substantially outperforms CpopSing, VISinger and RefineSinger in both subjective and objective metrics and incorporates a DSP synthesizer into the decoder to solve the above issues.

Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN

  • Yanli LiCongyi Wang
  • Computer Science
    Proceedings of the 4th International Conference on Advanced Information Science and System
  • 2022
This paper proposes a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN), which considers the pointwise truism score distribution of real and fake wave segments and combines the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability.

References

SHOWING 1-10 OF 40 REFERENCES

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

This paper introduces multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling and proposes a novel sub-frequency GAN on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands and models each sub-band with a separate discriminator.

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.

Adversarial Audio Synthesis

WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.

Adversarially Trained Multi-Singer Sequence-To-Sequence Singing Synthesizer

Both objective and subjective evaluations indicate that the proposed synthesizer can generate higher quality singing voice than baseline, and the articulation of high-pitched vowels is significantly enhanced.

Crepe: A Convolutional Representation for Pitch Estimation

This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.

MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms

MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice, is proposed and applied to perform music style transfer.

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time, is proposed and a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets is added.

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling, follows the main architecture of FastSpeech while proposing some singing-specific design which demonstrates the overwhelming advantages of XiaoiceSing.