Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders
@article{Luo2019LearningDR, title={Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders}, author={Yin-Jyun Luo and Kat R. Agres and Dorien Herremans}, journal={ArXiv}, year={2019}, volume={abs/1906.08152} }
In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively. For reconstruction, latent variables of timbre and pitch are sampled from corresponding mixture…
Figures and Tables from this paper
31 Citations
Unsupervised Disentanglement of Pitch and Timbre for Isolated Musical Instrument Sounds
- Computer ScienceISMIR
- 2020
A framework that achieves unsupervised pitch and timbre disentanglement for isolated musical instrument sounds without relying on data annotations or pre-trained neural networks is proposed, based on variational auto-encoders.
Pitch-Timbre Disentanglement Of Musical Instrument Sounds Based On Vae-Based Metric Learning
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
Experimental results show that the proposed representation learning method can find better-structured disentangled representations with pitch and timbre clusters even for unseen musical instruments.
Hyperbolic Timbre Embedding for Musical Instrument Sound Synthesis Based on Variational Autoencoders
- Computer Science2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2022
This paper proposes a VAE-based MISS method based on a variational autoencoder that has a hierarchy-inducing latent space for timbre that can represent treelike data more efficiently than the Euclidean space owing to its exponential growth property.
Unsupervised Disentanglement of Timbral, Pitch, and Variation Features From Musical Instrument Sounds With Random Perturbation
- Computer Science2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2022
The proposed unsupervised disentanglement method for musical instrument sounds with pitched and unpitched spectra can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.
Vector-Quantized Timbre Representation
- Computer ScienceArXiv
- 2020
This paper introduces an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution, and targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
Drum Synthesis and Rhythmic Transformation with Adversarial Autoencoders
- Computer ScienceACM Multimedia
- 2020
This paper presents a method for joint synthesis and rhythm transformation of drum sounds through the use of adversarial autoencoders (AAE) to navigate both the timbre and rhythm of drum patterns in audio recordings through expressive control over a low-dimensional latent space.
Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work proposes a flexible framework that deals with both singer conversion and singers vocal technique conversion, and is the first work to jointly tackle conversion of singer identity and vocal technique based on a deep learning approach.
Caesynth: Real-Time Timbre Interpolation and Pitch Control with Conditional Autoencoders
- Computer Science2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)
- 2021
It is demonstrated by experiments that CAESynth achieves smooth and high-fidelity audio synthesis in real-time through timbre interpolation and independent yet accurate pitch control for musical cues as well as for audio affordance with environmental sound.
Signal Representations for Synthesizing Audio Textures with Generative Adversarial Networks
- Computer ScienceArXiv
- 2021
This paper proposes that training GANs on single-channel magnitude spectra, and using the Phase Gradient Heap Integration (PGHI) inversion algorithm is a better comprehensive approach for audio synthesis modeling of diverse signals that include pitched, non-pitched, and dynamically complex sounds.
Timbre Classification of Musical Instruments with a Deep Learning Multi-Head Attention-Based Model
- Computer ScienceArXiv
- 2021
A model based on deep learning that is able to identify different instrument timbres with as few parameters as possible is defined, allowing the ability of the proposed architecture to distinguish timbre and to establish the aspects on which future work should focus.
36 References
Learning Disentangled Representations for Timber and Pitch in Music Audio
- Computer ScienceArXiv
- 2018
This paper proposes two deep convolutional neural network models for learning disentangled representation of musical timbre and pitch and shows that the second model can better change the instrumentation of a multi-instrument music piece without much affecting the pitch structure.
Generative timbre spaces with variational audio synthesis
- Computer ScienceArXiv
- 2018
This work adapts VAEs to create a generative latent space, while using perceptual ratings from timbre studies to regularize the organization of this space, and introduces a method for descriptor-based synthesis and shows that it can control the descriptors of an instrument while keeping its timbre structure.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
- Computer ScienceICML
- 2017
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
Bridging Audio Analysis, Perception and Synthesis with Perceptually-regularized Variational Timbre Spaces
- Computer ScienceISMIR
- 2018
It is shown that Variational Auto-Encoders (VAE) can bridge the lines of research and alleviate their weaknesses by regularizing the latent spaces to match perceptual distances collected from timbre studies by proposing three types of regularization and showing that these spaces can be used for efficient audio classification.
Modulated Variational auto-Encoders for many-to-many musical timbre transfer
- Computer ScienceArXiv
- 2018
This paper introduces the Modulated Variational auto-Encoders (MoVE) to perform musical timbre transfer, and shows that this architecture allows for generative controls in multi-domain transfer, yet remaining light, fast to train and effective on small datasets.
MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer
- Computer ScienceISMIR
- 2018
We introduce MIDI-VAE, a neural network model based on Variational Autoencoders that is capable of handling polyphonic music with multiple instrument tracks, as well as modeling the dynamics of music…
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
- Computer ScienceICLR
- 2019
TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer, is introduced.
WaveNet: A Generative Model for Raw Audio
- Computer ScienceSSW
- 2016
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Hierarchical Generative Modeling for Controllable Speech Synthesis
- Computer ScienceICLR
- 2019
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed.
Towards Timbre-Invariant Audio Features for Harmony-Based Music
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2010
A novel procedure is described that further enhances chroma features by significantly boosting the degree of timbre invariance without degrading the features' discriminative power, revealing the musical meaning of certain pitch-frequency cepstral coefficients.