Caesynth: Real-Time Timbre Interpolation and Pitch Control with Conditional Autoencoders

  title={Caesynth: Real-Time Timbre Interpolation and Pitch Control with Conditional Autoencoders},
  author={Aaron Valero Puche and Sukhan Lee},
  journal={2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)},
  • Aaron Valero PucheSukhan Lee
  • Published 25 October 2021
  • Computer Science
  • 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)
In this paper, we present a novel audio synthesizer, CAESynth, based on a conditional autoencoder. CAESynth synthesizes timbre in real-time by interpolating the reference sounds in their shared latent feature space, while controlling a pitch independently. We show that training a conditional autoen-coder based on accuracy in timbre classification together with adversarial regularization of pitch content allows timbre distribution in latent space to be more effective and stable for timbre… 

Figures and Tables from this paper



TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer, is introduced.

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.

Conditioning Autoencoder Latent Spaces for Real-Time Timbre Interpolation and Synthesis

This work proposes a one-hot encoded chroma feature vector for use in both input augmentation and latent space conditioning and measures the performance of these networks, and characterize the latent embeddings that arise from the use of this chroma conditioning vector.

GANSynth: Adversarial Neural Audio Synthesis

Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

Waveglow: A Flow-based Generative Network for Speech Synthesis

WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

The applicability of CycleGAN-VC/VC2 to mel-spectrogram conversion was examined and it was discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion.

Audio Style Transfer

A flexible framework for the task, which uses a sound texture model to extract statistics characterizing the reference audio style, followed by an optimization-based audio texture synthesis to modify the target content, is proposed.

Improved Techniques for Training GANs

This work focuses on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic, and presents ImageNet samples with unprecedented resolution and shows that the methods enable the model to learn recognizable features of ImageNet classes.

FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.