• Corpus ID: 227737022

I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch

  title={I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch},
  author={Joseph P. Turian and Max Henry},
Growing research demonstrates that synthetic failure modes imply poor generalization. We compare commonly used audio-to-audio losses on a synthetic benchmark, measuring the pitch distance between two stationary sinusoids. The results are surprising: many have poor sense of pitch direction. These shortcomings are exposed using simple rank assumptions. Our task is trivial for humans but difficult for these audio distances, suggesting significant progress can be made in self-supervised audio… 

Figures and Tables from this paper

Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR
A pre-trained acoustic model is used to generate a perceptual loss that makes speech enchancement more aware of the phonetic properties of the signal, which keeps some benefits of joint training, while alleviating the overfitting problem.
One Billion Audio Sounds from GPU-Enabled Modular Synthesis
A multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, which is 100x larger than any audio dataset in the literature, and proposes novel approaches to synthesizer hyperparameter optimization.
Towards Lightweight Controllable Audio Synthesis with Conditional Implicit Neural Representations
This work aims to shed light on the potential of Conditional Implicit Neural Representations (CINRs) as lightweight backbones in generative frameworks for audio synthesis.


Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Singing Voice Separation with Deep U-Net Convolutional Networks
This work proposes a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction.
SING: Symbol-to-Instrument Neural Generator
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
Crepe: A Convolutional Representation for Pitch Estimation
This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.
Self-supervised Pitch Detection by Inverse Audio Synthesis
It is demonstrated that DDSP modules can enable a new approach to self-supervision, generating synthetic audio with differentiable synthesizers and training feature extractor networks to infer the synthesis parameters.
Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network
A convolutional DNN is trained of around a billion parameters to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures and may be useful for automatic removal of vocalSounds from musical mixture for 'karaoke' type applications.
Speech Denoising with Deep Feature Losses
An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
DDSP: Differentiable Digital Signal Processing
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.