One-Shot Conditional Audio Filtering of Arbitrary Sounds

@article{Gfeller2021OneShotCA,
  title={One-Shot Conditional Audio Filtering of Arbitrary Sounds},
  author={Beat Gfeller and Dominik Roblek and Marco Tagliasacchi},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={501-505}
}
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a wave-to-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured" to filter arbitrary sound sources, even ones that it has not seen… 

Figures and Tables from this paper

Few-Shot Musical Source Separation
TLDR
It is shown that the proposed few-shot conditioning paradigm outperforms the baseline one-hot instrument-class conditioned model for both seen and unseen instruments.
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
TLDR
This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment and enrollment-based approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.
AvaTr: One-Shot Speaker Extraction with Transformers
TLDR
Two models to incorporate the voice characteristics in Transformer based on different insights of where the feature selection should take place yield excellent performance, on par or better than published state-of-theart models on the speaker extraction task, including separating speech of novel speakers not seen during training.
Text-Driven Separation of Arbitrary Sounds
TLDR
This work proposes a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source, by combining two distinct models that are agnostic to the conditioning modal-ity.
Do You See What I See? Capabilities and Limits of Automated Multimedia Content Analysis
TLDR
The capabilities and limitations of tools for analyzing online multimedia content are explained and the potential risks of using these tools at scale without accounting for their limitations are highlighted.
Improving Target Sound Extraction with Timestamp Information
TLDR
Experimental results on the syn-thesized data generated from the Freesound Datasets show that the proposed method can significantly improve the performance of TSE and a mutual learning framework of the target sound detection and extraction is proposed.
Few-shot learning of new sound classes for target sound extraction
TLDR
This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes.
Multistage linguistic conditioning of convolutional layers for speech emotion recognition
TLDR
This work proposes a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrasts it with a single-stage one where the streams are merged in a single point.

References

SHOWING 1-10 OF 29 REFERENCES
FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Librispeech: An ASR corpus based on public domain audio books
TLDR
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations
TLDR
Temporal Feature-Wise Linear Modulation (TFiLM) is proposed, a novel architectural component inspired by adaptive batch normalization and its extensions that uses a recurrent neural network to alter the activations of a convolutional model.
FiLM: Visual Reasoning with a General Conditioning Layer
TLDR
It is shown that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning.
Conditioned Source Separation for Musical Instrument Performances
TLDR
This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation.
Wavesplit: End-to-End Speech Separation by Speaker Clustering
TLDR
Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.
A Spectral Energy Distance for Parallel Speech Synthesis
TLDR
This work proposes a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function, based on a generalized energy distance between the distributions of the generated and real audio.
DDSP: Differentiable Digital Signal Processing
TLDR
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.
Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision
TLDR
This work proposes objective functions and network architectures that enable training a source separation system with weak labels and benchmarks the performance of the algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments.
...
1
2
3
...