Corpus ID: 53873046

Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces

  title={Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces},
  author={Philippe Esling and Axel Chemla-Romeu-Santos and Adrien Bitton},
Generative models aim to understand the properties of data, through the construction of latent spaces that allow classification and generation. However, as the learning is unsupervised, the latent dimensions are not related to perceptual properties. In parallel, music perception research has aimed to understand timbre based on human dissimilarity ratings. These lead to timbre spaces which exhibit perceptual similarities between sounds. However, they do not generalize to novel examples and do… Expand
Timbre latent space: exploration and creative aspects
The following experiments are led in cooperation with two composers and propose new creative directions to explore latent sound synthesis of musical timbres, using specifically designed interfaces (Max/MSP, Pure Data) or mappings for descriptor-based synthesis. Expand
Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders
In depth evaluation confirms the model ability to successfully disentangle timbre and pitch and enables timbre transfer between multiple instruments, with a single autoencoder architecture. Expand
Pitch-Timbre Disentanglement Of Musical Instrument Sounds Based On Vae-Based Metric Learning
Experimental results show that the proposed representation learning method can find better-structured disentangled representations with pitch and timbre clusters even for unseen musical instruments. Expand
DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks
A Generative Adversarial Network is applied to the task of audio synthesis of drum sounds and it is shown that the approach considerably improves the quality of the generated drum samples, and that the conditional input indeed shapes the perceptual characteristics of the sounds. Expand
GANSynth: Adversarial Neural Audio Synthesis
Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts. Expand
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
This paper shows how a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization of audio spectrograms and its application to audio source separation and provides insights on the choice and interpretability of data representation and model parameterization. Expand
DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs
This work performs knowledge distillation from a large audio tagging system into an adversarial audio synthesizer that is called DarkGAN, and shows that DarkGAN can synthesize musical audio with acceptable quality and exhibits moderate attribute control even with out-of-distribution input conditioning. Expand
Deep Music Analogy Via Latent Representation Disentanglement
An explicitly-constrained variational autoencoder (EC$^2$-VAE) is contributed as a unified solution to all three sub-problems of disentangling music representations and is validated using objective measurements and evaluated by a subjective study. Expand
Recent advancements in generative audio synthesis have allowed for the development of creative tools for generation and manipulation of audio. In this paper, a strategy is proposed for the synthesisExpand
Visualization-based disentanglement of latent space
  • Runze Huang, Qianying Zheng, Haifang Zhou
  • Computer Science
  • Neural Computing and Applications
  • 2021
A novel method that uses the encoder–decoder architecture to disentangle data into two visualizable representations that are encoded as latent spaces is proposed, which can be used to manipulate a wide range of data attributes and to generate realistic music via analogy. Expand


Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced. Expand
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
A Meta-analysis of Timbre Perception Using Nonlinear Extensions to CLASCAL
Isomap is designed to eliminate undesirable nonlinearities in the input data in order to reduce the overall dimensionality and succeeds in these goals for timbre spaces, compressing the output onto well-known dimensions of timbre and highlighting the challenges inherent in quantifying differences in spectral shape. Expand
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
Learning an interpretable factorised representation of the independent data generative factors of the world without supervision is an important precursor for the development of artificialExpand
Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes
The model with latent classes and specificities gave a better fit to the data and made the acoustic correlates of the common dimensions more interpretable, suggesting that musical timbres possess specific attributes not accounted for by these shared perceptual dimensions. Expand
Learning Latent Representations for Speech Generation and Transformation
The capability of the convolutional VAE model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data is demonstrated. Expand
Multidimensional perceptual scaling of musical timbres.
  • J. Grey
  • Mathematics, Medicine
  • The Journal of the Acoustical Society of America
  • 1977
Two experiments were performed to evaluate the perceptual relationships between 16 music instrument tones, and a three‐dimensional scaling solution was found to be interpretable in terms of the spectral energy distribution. Expand
Perceptual effects of spectral modifications on musical timbres
An experiment was performed to evaluate the effects of spectral modifications on the similarity structure for a set of musical timbres. The stimuli were 16 music instrument tones, 8 of which wereExpand
A meta‐analysis of acoustic correlates of timbre dimensions
A meta‐analysis of ten published timbre spaces was conducted using multidimensional scaling analyses (CLASCAL) of dissimilarity ratings on recorded, resynthesized, or synthesized musical instrumentExpand
Isolating the dynamic attributes of musical timbre.
The results indicate that the dynamic attributes of timbre are not only present at the onset, but also throughout, and that multiple acoustic attributes may contribute to the same perceptual dimensions. Expand