• Publications
  • Influence
Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms
TLDR
FAD is validated using a wide variety of artificial distortions and is compared to the signal based metrics signal to distortion ratio (SDR), cosine distance and magnitude L2 distance and shows that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR,cosine distance or magnitude L1 distance.
Self-supervised audio representation learning for mobile devices
TLDR
The quality of the embeddings produced by the self-supervised learning models are evaluated, and it is shown that they can be re-used for a variety of downstream tasks, and for some tasks even approach the performance of fully supervised models of similar size.
Pre-Training Audio Representations With Self-Supervision
TLDR
This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.
Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms
TLDR
The Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms, is proposed and it is shown that, with a correlation coefficient of 0.52, FAD correlates more closely with human perception than either SDR, cosine distance or magnitude L2 distance.
From Here to There: Video Inbetweening Using Direct 3D Convolutions
TLDR
A fully convolutional model to generate video sequences directly in the pixel domain by obtaining a latent video representation using a stochastic fusion mechanism that learns how to incorporate information from the start and end frames.
One-Shot Conditional Audio Filtering of Arbitrary Sounds
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a
Real-Time Speech Frequency Bandwidth Extension
TLDR
A lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth, achieving an architectural latency of 16ms.
Training Keyword Spotters with Limited and Synthesized Speech Data
TLDR
This paper uses a pre-trained speech embedding model trained to extract useful features for keyword spotting models, and shows that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples.
Extending Siena to support more expressive and flexible subscriptions
This paper defines and discusses the implementation of two novel extensions to the Siena Content-based Network (CBN) to extend it to become a Knowledge-based Network (KBN) thereby increasing the
Now Playing: Continuous low-power music recognition
TLDR
A low-power music recognizer that runs entirely on a mobile device and automatically recognizes music without user interaction is presented, which respects user privacy by running entirely on-device and can passively recognize a wide range of music.
...
...