• Corpus ID: 210473083

DDSP: Differentiable Digital Signal Processing

  title={DDSP: Differentiable Digital Signal Processing},
  author={Jesse Engel and Lamtharn Hantrakul and Chenjie Gu and Adam Roberts},
Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto… 
RAVE: A variational autoencoder for fast and high-quality neural audio synthesis
This paper introduces a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis and introduces a novel two-stage training procedure, namely representation learning and adversarial fine-tuning.
A Spectral Energy Distance for Parallel Speech Synthesis
This work proposes a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function, based on a generalized energy distance between the distributions of the generated and real audio.
HooliGAN: Robust, High Quality Neural Vocoding
This work introduces HooliGAN, a robust vocoder that has state of the art results, finetunes very well to smaller datasets (<30 minutes of speech data) and generates audio at 2.2MHz on GPU and 35kHz on CPU.
Multi-instrument Music Synthesis with Spectrogram Diffusion
This work compares training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and finds that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fréchet distance metrics.
DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks
A Generative Adversarial Network is applied to the task of audio synthesis of drum sounds and it is shown that the approach considerably improves the quality of the generated drum samples, and that the conditional input indeed shapes the perceptual characteristics of the sounds.
Audio representations for deep learning in sound synthesis: A review
  • A. Natsiou, Sean O'Leary
  • Computer Science
    2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA)
  • 2021
An overview of audio representations applied to sound synthesis using deep learning and the most significant methods for developing and evaluating a sound synthesis architecture using deeplearning models, always depending on the audio representation are presented.
Real-time Timbre Transfer and Sound Synthesis using DDSP
A real-time implementation of the DDSP library embedded in a virtual synthesizer as a plug-in that can be used in a Digital Audio Workstation focused on timbre transfer from learned representations of real instruments to arbitrary sound inputs as well as controlling these models by MIDI.
Streamable Neural Audio Synthesis With Non-Causal Convolutions
This paper introduces a new method allowing to produce non-causal streaming models, which allows to make any convolutional model compatible with real-time buffer-based processing and is able to transform models trained without causal constraints into a streaming model.
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations
A neural analysis and synthesis framework that can manipulate voice, pitch, and speed of an arbitrary speech signal by proposing a novel training strategy based on information perturbation, which allows for fully self-supervised training.
Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization
Evaluation result shows that new model outperforms the prior one both objectively and subjectively, and is employed to unconditionally generate sequences of piano and violin music and finds the result promising.


Adversarial Audio Synthesis
WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.
GANSynth: Adversarial Neural Audio Synthesis
Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.
SING: Symbol-to-Instrument Neural Generator
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
Universal audio synthesizer control with normalizing flows
A novel formulation of audio synthesizer control is introduced that can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model and is able to learn semantic controls of a synthesizer by smoothly mapping to its parameters.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN
A deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm, which facilitates the modelling of the large variability of pitch in the singing voice.
Crepe: A Convolutional Representation for Pitch Estimation
This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Efficient Neural Audio Synthesis
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis
It was demonstrated that the NSF models generated waveforms at least 100 times faster than the authors' WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.