Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms

@inproceedings{Kilgour2019FrchetAD,
  title={Fr{\'e}chet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms},
  author={Kevin Kilgour and Mauricio Zuluaga and Dominik Roblek and Matthew Sharifi},
  booktitle={INTERSPEECH},
  year={2019}
}
We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. [] Key Method FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance, and magnitude L2 distance. We show that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance, with correlation coefficients…

Figures from this paper

Audio Inpainting based on Self-similarity for Sound Source Separation Applications
TLDR
A simple algorithm capable of audio inpainting based on self-similarity within the signal that attempts to use the non-zero bin values observed in similar frames (past or future) as substitutes for the zero bin values in the current analysis frame.
A Study on Robustness to Perturbations for Representations of Environmental Sound
TLDR
This work aims to extend HEAR to evaluate invariance to channel effects in this work, and imitate channel effects by injecting perturbations to the audio signal and measuring the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent.
BEHM-GAN: Bandwidth Extension of Historical Music using Generative Adversarial Networks
TLDR
The results of a formal blind listening test show that BEHM- GAN increases the perceptual sound quality in early-20th-century gramophone recordings and represents a relevant step toward data-driven music restoration in real-world scenarios.
Removing Distortion Effects in Music Using Deep Neural Networks
TLDR
This paper focuses on removing distortion and clipping applied to guitar tracks for music production while presenting a comparative investigation of different deep neural network (DNN) architectures on this task, achieving exceptionally good results in distortion removal using DNNs.
Perceiving Music Quality with GANs
TLDR
This work proposes training a generative adversarial network on a music library, and using its discriminator as a measure of the perceived quality of music, and shows a statistically significant correlation with human ratings of music.
Learning to Denoise Historical Music
TLDR
An audio-to-audio neural network model that learns to denoise old music recordings by means of a short-time Fourier transform and processes the resulting complex spectrogram using a convolutional neural network is proposed.
StyleWaveGAN: Style-based synthesis of drum sounds with extensive controls using generative adversarial networks
TLDR
By conditioning StyleWaveGAN on both the type of drum and several audio descriptors, it is able to synthesize waveforms faster than real-time on a GPU directly in CD quality up to a duration of 1.5s while retaining a considerable amount of control over the generation.
Automatic Quality Assessment of Digitized and Restored Sound Archives
TLDR
A framework to assess the quality of experience (QoE) of sound archives in an automatic fashion is presented and the reasons why stake- holders, such as archivists, broadcasters, or public listeners, would benefit from the proposed framework are provided.
A Proposal for Foley Sound Synthesis Challenge
TLDR
This work proposes a challenge for automatic foley synthesis, and out-line the details and design considerations of a foley sound synthesis challenge, including task definition, dataset requirements, and evaluation criteria.
Conditioned Source Separation for Musical Instrument Performances
TLDR
This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation.
...
...

References

SHOWING 1-10 OF 13 REFERENCES
Performance measurement in blind audio source separation
TLDR
This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
SDR – Half-baked or Well Done?
TLDR
It is argued here that the signal-to-distortion ratio (SDR) implemented in the BSS_eval toolkit has generally been improperly used and abused, especially in the case of single-channel separation, resulting in misleading results.
A short-time objective intelligibility measure for time-frequency weighted noisy speech
TLDR
An objective intelligibility measure is presented, which shows high correlation (rho=0.95) with the intelligibility of both noisy, and TF-weighted noisy speech, and shows significantly better performance than three other, more sophisticated, objective measures.
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs
TLDR
A new model has been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay, known as perceptual evaluation of speech quality (PESQ).
Singing Voice Separation with Deep U-Net Convolutional Networks
TLDR
This work proposes a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction.
Music Source Separation Using Stacked Hourglass Networks
TLDR
Experimental results on MIR-1K and DSD100 datasets validate that the proposed method achieves competitive results comparable to the state-of-the-art methods in multiple music source separation and singing voice separation tasks.
Towards Accurate Generative Models of Video: A New Metric & Challenges
TLDR
A large-scale human study is contributed, which confirms that FVD correlates well with qualitative human judgment of generated videos, and provides initial benchmark results on SCV.
Supervised Speech Separation Based on Deep Learning: An Overview
TLDR
This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years, and provides a historical perspective on how advances are made.
MIR_EVAL: A Transparent Implementation of Common MIR Metrics
Central to the field of MIR research is the evaluation of algorithms used to extract information from music data. We present mir_eval, an open source software library which provides a transparent and
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
TLDR
This work proposes a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions and introduces the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score.
...
...