Neural Percussive Synthesis Parameterised by High-Level Timbral Features

@article{Ramires2020NeuralPS,
  title={Neural Percussive Synthesis Parameterised by High-Level Timbral Features},
  author={Ant{\'o}nio Ramires and Pritish Chandna and Xavier Favory and Emilia G'omez and Xavier Serra},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={786-790}
}
  • António Ramires, Pritish Chandna, X. Serra
  • Published 25 November 2019
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We present a deep neural network-based methodology for synthesising percussive sounds with control over high-level timbral characteristics of the sounds. This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing. We use a feedforward convolutional neural network-based architecture, which is able to map input parameters to the corresponding waveform. We propose two datasets to evaluate our approach on both a… 

Figures and Tables from this paper

DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks
TLDR
A Generative Adversarial Network is applied to the task of audio synthesis of drum sounds and it is shown that the approach considerably improves the quality of the generated drum samples, and that the conditional input indeed shapes the perceptual characteristics of the sounds.
Differentiable Digital Signal Processing Mixture Model for Synthesis Parameter Extraction from Mixture of Harmonic Sounds
TLDR
This paper proposes a model (DDSP mixture model) that represents a mixture as the sum of the outputs of multiple pretrained DDSP autoencoders and shows that the proposed method has high and stable performance compared with a straightforward method that applies the DD SP autoencoder to the signals separated by an audio source separation method.
StyleWaveGAN: Style-based synthesis of drum sounds with extensive controls using generative adversarial networks
TLDR
By conditioning StyleWaveGAN on both the type of drum and several audio descriptors, it is able to synthesize waveforms faster than real-time on a GPU directly in CD quality up to a duration of 1.5s while retaining a considerable amount of control over the generation.
ADVERSARIAL SYNTHESIS OF DRUM SOUNDS
TLDR
A strategy is proposed for the synthesis of drum sounds using generative adversarial networks (GANs) based on a conditional Wasserstein GAN, which learns the underlying probability distribution of a dataset compiled of labeled drum sounds.
STYLE-BASED DRUM SYNTHESIS WITH GAN INVERSION
TLDR
An overview of an unsupervised approach to deriving useful feature controls learned by a generative model is provided and a system for generation and transformation of drum samples using a style-based generative adversarial network (GAN) is proposed.
Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks
TLDR
This paper implemented two GAN-based architectures and compared the results with real recordings as well as six traditional sound synthesis methods, showing encouraging results for the task at hand.
Loopnet: Musical Loop Synthesis Conditioned on Intuitive Musical Parameters
TLDR
This work presents LoopNet, a feed-forward generative model for creating loops conditioned on intuitive parameters and proposes intuitive controls for composers to map the ideas in their minds to an audio loop.
Make Your Own Audience: Virtual Listeners Can Filter Generated Drum Programs
TLDR
This work generates quick, scalable, percussion synthesizers using classical signal processing and uses features from Fourier transformations and autoencoder embeddings to train machine learning classifiers to find and classify synthesizer programs that mimic percussive sounds.
BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control
TLDR
BassNet, a deep learning model for generating bass guitar tracks based on musical source material is presented, which is trained to learn a temporally stable two-dimensional latent space variable that offers interactive user control.
One Billion Audio Sounds from GPU-Enabled Modular Synthesis
TLDR
A multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, which is 100x larger than any audio dataset in the literature, and proposes novel approaches to synthesizer hyperparameter optimization.
...
1
2
...

References

SHOWING 1-10 OF 22 REFERENCES
WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN
TLDR
A deep neural network based singing voice synthesizer, inspired by the Deep Convolutions Generative Adversarial Networks (DCGAN) architecture and optimized using the Wasserstein-GAN algorithm, which facilitates the modelling of the large variability of pitch in the singing voice.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Spectrogram Feature Losses for Music Source Separation
TLDR
It is demonstrated that adding a high-level feature loss term, extracted from the spectrograms using a VGG net, can improve separation quality vis-a-vis a pure pixel-level loss in deep learning-based music source separation.
Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders
TLDR
The proposed model generates notes as magnitude spectrograms from any probabilistic latent code samples, with expressive control of orchestral timbres and playing styles, and can be applied to other sound domains, including an user's libraries with custom sound tags that could be mapped to specific generative controls.
Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces
TLDR
It is shown that Variational Auto-Encoders (VAE) can bridge the lines of research and alleviate their weaknesses by regularizing the latent spaces to match perceptual distances collected from timbre studies by proposing three types of regularization and showing that these spaces can be used for efficient audio classification.
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
TLDR
The Wave-U-Net is proposed, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales and indicates that its architecture yields a performance comparable to a state-of-the-art spectrogram-based U- net architecture, given the same data.
GANSynth: Adversarial Neural Audio Synthesis
TLDR
Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.
Modulated Variational auto-Encoders for many-to-many musical timbre transfer
TLDR
This paper introduces the Modulated Variational auto-Encoders (MoVE) to perform musical timbre transfer, and shows that this architecture allows for generative controls in multi-domain transfer, yet remaining light, fast to train and effective on small datasets.
...
1
2
3
...