One Billion Audio Sounds from GPU-Enabled Modular Synthesis

@article{Turian2021OneBA,
  title={One Billion Audio Sounds from GPU-Enabled Modular Synthesis},
  author={Joseph P. Turian and Jordie Shier and George Tzanetakis and Kirk McNally and Max Henry},
  journal={2021 24th International Conference on Digital Audio Effects (DAFx)},
  year={2021},
  pages={222-229}
}
We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, paired with the synthesis parameters used to generate them. The dataset is 100x larger than any audio dataset in the literature. We also introduce torchsynth, an open source modular synthesizer that generates the synth 1B1 samples on-the-fly at 16200x faster than real-time (714MHz) on a single GPU. Finally, we release two new audio datasets: FM synth timbre and subtractive synth pitch. Using… 

Figures and Tables from this paper

Learning Audio Representations with MLPs

In this paper, we propose an efficient MLP-based approach for learning audio representations, namely timestamp and scene-level audio embeddings. We use an encoder consisting of sequentially stacked

Amp-Space: A Large-Scale Dataset for Fine-Grained Timbre Transformation

  • Jason Naradowsky
  • Computer Science
    2021 24th International Conference on Digital Audio Effects (DAFx)
  • 2021
A large-scale dataset of paired audio samples: a source audio signal, and an output signal, the result of a timbre transformation, which shows potential use cases by pre-training a conditional WaveNet model on synthetic data and showing that it reduces the number of samples necessary to digitally reproduce a real musical device.

Synthesizer Sound Matching with Differentiable DSP

A novel approach to the problem of synthesizer sound matching is proposed by implementing a basic subtractive synthesizer using differentiable DSP modules that has interpretable controls and is similar to those used in music production.

Survival of the synthesis—GPU accelerating evolutionary sound matching

An optimized design for matching sounds generated by frequency modulation (FM) audio synthesis using the graphics processing unit (GPU) is proposed and the relative speedup over the naive serial implementation continues to increase beyond simple FM to more advanced structures.

Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

This study finds that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks, and proposes a simple approach to compose features effective for general-purpose applications, consisting of calculating feature vectors along the time frame from middle/late layer outputs.

Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks

This paper implemented two GAN-based architectures and compared the results with real recordings as well as six traditional sound synthesis methods, showing encouraging results for the task at hand.

Modeling Animal Vocalizations through Synthesizers

Lighter-weight models that incorporate structured modules and domain knowledge, notably DDSP, have been shown to produce high-quality musical sound, however, a lack of signal-processing knowledge may hinder users from effectively manipulating the synthesis parameters.

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

It is hypothesized that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound, and a self-supervised learning method is proposed: Bootstrap Your Own Latent for Audio (BYOL-A, pronounced ”viola”).

TimbreCLIP: Connecting Timbre to Text and Images

Work in progress on TimbreCLIP, an audio-text cross modal embedding trained on single instrument notes, is presented and the application of the models is demonstrated on two tasks: text-driven audio equalization and timbre to image generation.

Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

This work proposes a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches, and set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2.

References

SHOWING 1-10 OF 41 REFERENCES

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.

Universal audio synthesizer control with normalizing flows

A novel formulation of audio synthesizer control is introduced that can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model and is able to learn semantic controls of a synthesizer by smoothly mapping to its parameters.

Jukebox: A Generative Model for Music

It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

Amp-Space: A Large-Scale Dataset for Fine-Grained Timbre Transformation

  • Jason Naradowsky
  • Computer Science
    2021 24th International Conference on Digital Audio Effects (DAFx)
  • 2021
A large-scale dataset of paired audio samples: a source audio signal, and an output signal, the result of a timbre transformation, which shows potential use cases by pre-training a conditional WaveNet model on synthetic data and showing that it reduces the number of samples necessary to digitally reproduce a real musical device.

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

By using notes as an intermediate representation, a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude are trained, a process the authors call Wave2Midi2Wave.

Neural Granular Sound Synthesis

It is demonstrated that generative neural networks can implement granular synthesis while alleviating most of its shortcomings, and a major advantage of this proposal is that the resulting grain space is invertible, meaning that it can continuously synthesize sound when traversing its dimensions.

Latent Timbre Synthesis

The Latent Timbre Synthesis is presented, a new audio synthesis method using deep learning that allows composers and sound designers to interpolate and extrapolate between the timbre of multiple sounds using the latent space of audio frames.

Contrastive Learning of General-Purpose Audio Representations

This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.

Neural Percussive Synthesis Parameterised by High-Level Timbral Features

This approach allows for intuitive control of a synthesizer, enabling the user to shape sounds without extensive knowledge of signal processing, using a feedforward convolutional neural network-based architecture, which is able to map input parameters to the corresponding waveform.

Automatic Programming of VST Sound Synthesizers Using Deep Networks and Other Techniques

A bidirectional, long short-term memory network with highway layers performed better than any other technique and was able to match sounds closely in 25% of the test cases and provides a significant speed advantage over previously reported techniques that are based on search heuristics.