One-Shot Conditional Audio Filtering of Arbitrary Sounds

  title={One-Shot Conditional Audio Filtering of Arbitrary Sounds},
  author={Beat Gfeller and Dominik Roblek and Marco Tagliasacchi},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a wave-to-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured" to filter arbitrary sound sources, even ones that it has not seen… Expand

Figures and Tables from this paper

AvaTr: One-Shot Speaker Extraction with Transformers
Two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place yield excellent performance, on par or better than published state-of-theart models on the speaker extraction task, including separating speech of novel speakers not seen during training. Expand
Few-shot learning of new sound classes for target sound extraction
This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes. Expand
Multistage linguistic conditioning of convolutional layers for speech emotion recognition
In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistageExpand


FSD50K: an Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research. Expand
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers. Expand
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself. Expand
Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations
Temporal Feature-Wise Linear Modulation (TFiLM) is proposed, a novel architectural component inspired by adaptive batch normalization and its extensions that uses a recurrent neural network to alter the activations of a convolutional model. Expand
FiLM: Visual Reasoning with a General Conditioning Layer
It is shown that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Expand
Conditioned Source Separation for Musical Instrument Performances
This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation. Expand
Wavesplit: End-to-End Speech Separation by Speaker Clustering
Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset. Expand
A Spectral Energy Distance for Parallel Speech Synthesis
This work proposes a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function, based on a generalized energy distance between the distributions of the generated and real audio. Expand
DDSP: Differentiable Digital Signal Processing
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses. Expand
Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision
This work proposes objective functions and network architectures that enable training a source separation system with weak labels and benchmarks the performance of the algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments. Expand