One-Shot Conditional Audio Filtering of Arbitrary Sounds

  title={One-Shot Conditional Audio Filtering of Arbitrary Sounds},
  author={Beat Gfeller and Dominik Roblek and Marco Tagliasacchi},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a wave-to-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured" to filter arbitrary sound sources, even ones that it has not seen… 

Figures and Tables from this paper

Text-Driven Separation of Arbitrary Sounds
This work proposes a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source, by combining two distinct models that are agnostic to the conditioning modal-ity.
Few-Shot Musical Source Separation
It is shown that the proposed few-shot conditioning paradigm outperforms the baseline one-hot instrument-class conditioned model for both seen and unseen instruments.
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment and enrollment-based approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.
AvaTr: One-Shot Speaker Extraction with Transformers
Two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place yield excellent performance, on par or better than published state-of-theart models on the speaker extraction task, including separating speech of novel speakers not seen during training.
Improving Target Sound Extraction with Timestamp Information
Experimental results on the syn-thesized data generated from the Freesound Datasets show that the proposed method can significantly improve the performance of TSE and a mutual learning framework of the target sound detection and extraction is proposed.
Do You See What I See? Capabilities and Limits of Automated Multimedia Content Analysis
The capabilities and limitations of tools for analyzing online multimedia content are explained and the potential risks of using these tools at scale without accounting for their limitations are highlighted.
Multistage linguistic conditioning of convolutional layers for speech emotion recognition
This work proposes a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrasts it with a single-stage one where the streams are merged in a single point.
Few-shot learning of new sound classes for target sound extraction
This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes.
Speech enhancement with weakly labelled data from AudioSet
This paper proposes a speech enhancement framework trained on weakly labelled data that achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system.


FSD50K: An Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations
Temporal Feature-Wise Linear Modulation (TFiLM) is proposed, a novel architectural component inspired by adaptive batch normalization and its extensions that uses a recurrent neural network to alter the activations of a convolutional model.
FiLM: Visual Reasoning with a General Conditioning Layer
It is shown that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning.
SEANet: A Multi-modal Speech Enhancement Network
This work explores the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions by feeding a multi-modal input to SEANet, a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech.
Learning to Denoise Historical Music
An audio-to-audio neural network model that learns to denoise old music recordings by means of a short-time Fourier transform and processes the resulting complex spectrogram using a convolutional neural network is proposed.
A Spectral Energy Distance for Parallel Speech Synthesis
This work proposes a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function, based on a generalized energy distance between the distributions of the generated and real audio.
Unsupervised Sound Separation Using Mixtures of Mixtures
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Conditioned Source Separation for Musical Instrument Performances
This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation.