• Corpus ID: 232417882

Symbolic Music Generation with Diffusion Models

@article{Mittal2021SymbolicMG,
  title={Symbolic Music Generation with Diffusion Models},
  author={Gautam Mittal and Jesse Engel and Curtis Hawthorne and Ian Simon},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.16091}
}
Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our… 
Diffusion bridges vector quantized Variational AutoEncoders
TLDR
A new model to train the prior and the encoder/decoder networks simultaneously and is competitive with the autoregressive prior on the mini-Imagenet dataset and is very efficient in both optimization and sampling.
Score-based Generative Modeling in Latent Space
TLDR
The Latent Score-based Generative Model (LSGM) is proposed, a novel approach that trains SGMs in a latent space, relying on the variational autoencoder framework, and achieves state-of-the-art likelihood on the binarized OMNIGLOT dataset.
D2C: Diffusion-Denoising Models for Few-shot Conditional Generation
TLDR
Diffusion-Decoding models with Contrastive representations (D2C), a paradigm for training unconditional variational autoencoders (VAEs) for few-shot conditional image generation and contrastive selfsupervised learning to improve representation quality is described.
Structured Denoising Diffusion Models in Discrete State-Spaces
TLDR
D3PMs are diffusionlike generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. by going beyond corruption processes with uniform transition probabilities, and it is shown that the choice of transition matrix is an important design decision that leads to improved results in image and text domains.
Gotta Go Fast When Generating Data with Score-Based Models
TLDR
This work carefully devise an SDE solver with adaptive step sizes tailored to score-based generative models piece by piece, which generates data 2 to 10 times faster than EM while achieving better or equal sample quality.
Itô-Taylor Sampling Scheme for Denoising Diffusion Probabilistic Models using Ideal Derivatives
TLDR
A new DDPM sampler based on a second-order numerical scheme for stochastic differential equations (SDEs) while the conventional sampler isbased on a first- order numerical scheme, which the authors call "ideal derivative substitution”.
Music Composition with Deep Learning: A Review
TLDR
This paper analyzes the ability of current Deep Learning models to generate music with creativity or the similarity between AI and human composition processes, among others, to answer some of the most relevant open questions.
Dreamsound: Deep Activation Layer Sonification
TLDR
This paper presents DreamSound, a creative adaptation of Deep Dream to sound addressed from two approaches: input manipulation, and sonification design, and the chosen model is YAMNet, a pre-trained deep network for sound classification.
DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
TLDR
DiffGAN-TTS is a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis and an active shallow diffusion mechanism is presented to further speed up inference.
Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations
TLDR
This work proposes a new graph diffusion process that models the joint distribution of the nodes and edges through a system of stochastic differential equations (SDEs) and demonstrates the effectiveness of the system of SDEs in modeling the node-edge relationships.
...
1
2
...

References

SHOWING 1-10 OF 71 REFERENCES
A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music
TLDR
This work proposes the use of a hierarchical decoder, which first outputsembeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently, thereby avoiding the "posterior collapse" problem, which remains an issue for recurrent VAEs.
Learning a Latent Space of Multitrack Measures
TLDR
The recent MusicVAE model is extended to represent multitrack polyphonic measures as vectors in a latent space, which enables several useful operations such as generating plausible measures from scratch, interpolating between measures in a musically meaningful way, and manipulating specific musical attributes.
Counterpoint by Convolution
TLDR
This model is an instance of orderless NADE, which allows more direct ancestral sampling, and finds that Gibbs sampling greatly improves sample quality, which is demonstrated to be due to some conditional distributions being poorly modeled.
Jukebox: A Generative Model for Music
TLDR
It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.
GANSynth: Adversarial Neural Audio Synthesis
TLDR
Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.
PIANOTREE VAE: Structured Representation Learning for Polyphonic Music
TLDR
The experiments prove the validity of the PianoTree VAE viasemantically meaningful latent code for polyphonic segments, more satisfiable reconstruction aside of decent geometry learned in the latent space, and this model's benefits to the variety of the downstream music generation.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
Generating Sentences from a Continuous Space
TLDR
This work introduces and study an RNN-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences that allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features.
Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models
TLDR
This paper develops a method to condition generation without retraining the model, combining attribute constraints with a universal "realism" constraint, which enforces similarity to the data distribution, and generates realistic conditional images from an unconditional variational autoencoder.
Imposing higher-level Structure in Polyphonic Music Generation using Convolutional Restricted Boltzmann Machines and Constraints
TLDR
A Convolutional Restricted Boltzmann Machine as a generative model is combined with gradient descent constraint optimisation to provide further control over the generation process, and it is possible to control the higher-level self-similarity structure, the meter, and the tonal properties of the resulting musical piece, while preserving its local musical coherence.
...
1
2
3
4
5
...