• Corpus ID: 239024498

Chunked Autoregressive GAN for Conditional Waveform Synthesis

@article{Morrison2021ChunkedAG,
  title={Chunked Autoregressive GAN for Conditional Waveform Synthesis},
  author={Max Morrison and Rithesh Kumar and Kundan Kumar and Prem Seetharaman and Aaron C. Courville and Yoshua Bengio},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.10139}
}
Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram… 

PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping

PhaseAug is presented, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping and outperform baselines without any architecture modification.

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

This paper proposes a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts and introduces two kinds of discrimi- nators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band dis- criminator.

A Post Auto-regressive GAN Vocoder Focused on Spectrum Fracture

A post auto-regressive (AR) GAN vocoder with a self-attention layer, which will not participate in inference, but can assist the generator to learn temporal dependencies within frames in training.

Mel Spectrogram Inversion with Stable Pitch

This work proposes a new vocoder model that is specifically designed for music, and results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

This work presents BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting, and introduces periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality.

Multi-instrument Music Synthesis with Spectrogram Diffusion

This work compares training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and finds that the DDPM approach is superior both qualita-tively and as measured by audio reconstruction and Fréchet distance metrics.

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of the approach, including an approximate, numerically stable mixture of logistics attention.

Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis

The proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch, and is jointly optimized in an end-to-end manner with variational inference and adversarial objectives.

Iterative autoregression: a novel trick to improve your low-latency speech enhancement model

This paper presents a simple, yet effective trick for training of autoregressive low-latency speech enhancement models and demonstrates that the proposed technique leads to stable improvement across different architectures and training scenarios.

Improve GAN-based Neural Vocoder using Truncated Pointwise Relativistic Least Square GAN

  • Yanli LiCongyi Wang
  • Computer Science
    Proceedings of the 4th International Conference on Advanced Information Science and System
  • 2022
This paper proposes a simple yet effective variant of the LSGAN framework, named Truncated Pointwise Relativistic LSGAN (T-PRLSGAN), which considers the pointwise truism score distribution of real and fake wave segments and combines the Mean Squared error (MSE) loss with the proposed truncated pointwise relative discrepancy loss to increase the difficulty of the generator to fool the discriminator, leading to improved audio generation quality and stability.

References

SHOWING 1-10 OF 49 REFERENCES

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

High Fidelity Speech Synthesis with Adversarial Networks

GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This work presents a parallel endto-end TTS method that generates more natural sounding audio than current two-stage models and adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling.

Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis

This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet.

WaveFlow: A Compact Flow-based Model for Raw Audio

WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.

WaveGrad: Estimating Gradients for Waveform Generation

WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.

Crepe: A Convolutional Representation for Pitch Estimation

This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.

Efficient Neural Audio Synthesis

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

End-to-End Adversarial Text-to-Speech

This work takes on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.