One-Shot Conditional Audio Filtering of Arbitrary Sounds

  title={One-Shot Conditional Audio Filtering of Arbitrary Sounds},
  author={Beat Gfeller and Dominik Roblek and Marco Tagliasacchi},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a wave-to-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured" to filter arbitrary sound sources, even ones that it has not seen… 

Figures and Tables from this paper

Text-Driven Separation of Arbitrary Sounds

This work proposes a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source, and shows that SoundWords is effective at learning co-embeddings and that the multi-modal training approach improves the performance of SoundFilter.

Few-Shot Musical Source Separation

It is shown that the proposed few-shot conditioning paradigm outperforms the base-line one-hot instrument-class conditioned model for both seen and unseen instruments.

SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment-based and target sound extraction approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.

AvaTr: One-Shot Speaker Extraction with Transformers

Two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place yield excellent performance, on par or better than published state-of-theart models on the speaker extraction task, including separating speech of novel speakers not seen during training.

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

A streaming Transformer-based PSE model is presented and a novel cross-attention approach that gives adaptive target speaker representations is proposed that outperforms competitive baselines consistently, even when the model is only approximately half the size.

Real-Time Target Sound Extraction

The first neural network model to achieve real-time and streaming target sound extraction and an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder is presented.

CoSSD - An end-to-end framework for multi-instance source separation and detection

A key feature of the proposed CoSSD is that it performs detection in addition to separation, making it a practical and unified solution for query-based audio analysis.

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

. We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by

Improving Target Sound Extraction with Timestamp Information

Experimental results on the syn-thesized data generated from the Freesound Datasets show that the proposed method can significantly improve the performance of TSE and a mutual learning framework of the target sound detection and extraction is proposed.

Do You See What I See? Capabilities and Limits of Automated Multimedia Content Analysis

The capabilities and limitations of tools for analyzing online multimedia content are explained and the potential risks of using these tools at scale without accounting for their limitations are highlighted.



FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations

Temporal Feature-Wise Linear Modulation (TFiLM) is proposed, a novel architectural component inspired by adaptive batch normalization and its extensions that uses a recurrent neural network to alter the activations of a convolutional model.

FiLM: Visual Reasoning with a General Conditioning Layer

It is shown that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning.

SEANet: A Multi-modal Speech Enhancement Network

This work explores the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions by feeding a multi-modal input to SEANet, a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech.

Learning to Denoise Historical Music

An audio-to-audio neural network model that learns to denoise old music recordings by means of a short-time Fourier transform and processes the resulting complex spectrogram using a convolutional neural network is proposed.

A Spectral Energy Distance for Parallel Speech Synthesis

This work proposes a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function, based on a generalized energy distance between the distributions of the generated and real audio.

Unsupervised Sound Separation Using Mixtures of Mixtures

This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.

Conditioned Source Separation for Musical Instrument Performances

This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation.