Few-shot learning of new sound classes for target sound extraction

@inproceedings{Delcroix2021FewshotLO,
  title={Few-shot learning of new sound classes for target sound extraction},
  author={Marc Delcroix and Jorge Bennasar V'azquez and Tsubasa Ochiai and Keisuke Kinoshita and Shoko Araki},
  booktitle={Interspeech},
  year={2021}
}
Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen during training. However, it is not easy to extend this framework to new AE classes, i.e. unseen… 

Figures and Tables from this paper

SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

This paper introduces a TSE framework, SoundBeam, that combines the advantages of both enrollment-based and target sound extraction approaches, and performs an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of Sound beam.

Improving Target Sound Extraction with Timestamp Information

Experimental results on the syn-thesized data generated from the Freesound Datasets show that the proposed method can significantly improve the performance of TSE and a mutual learning framework of the target sound detection and extraction is proposed.

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

A reference-aware and duration-robust network (RaDur) for TSD is presented and an embedding enhancement module to take into account the mixture audio while generating the embedding is proposed, and the attention pooling is applied to enhance the features of target sound-related frames and weaken the Features of noisy frames.

Real-Time Target Sound Extraction

The first neural network model to achieve real-time and streaming target sound extraction and an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder is presented.

References

SHOWING 1-10 OF 34 REFERENCES

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

This work proposes a source separation framework trained with weakly labelled data that can separate 527 kinds of sound classes from AudioSet within a single system.

Listen to What You Want: Neural Network-based Universal Sound Selector

A universal sound selection neural network that enables to directly select AE sounds from a mixture given user-specified target AE classes, independently of the number of sources in the mixture is proposed.

Learning to Separate Sounds from Weakly Labeled Scenes

This work proposes objective functions and network architectures that enable training a source separation system with weak labels, and benchmarks performance using synthetic mixtures of overlapping sound events recorded in urban environments.

What’s all the Fuss about Free Universal Sound Separation Data?

  • Scott WisdomHakan Erdogan J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.

Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam

Strategies for improving the speaker discrimination capability of SpeakerBeam are investigated and it is shown experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures, and outperform TasNet in terms of target speech extraction.

Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures

A novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to thetarget speaker.

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

This paper introduces SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker and shows the benefit of including speaker information in the processing and the effectiveness of the proposed method.

Audio Query-based Music Source Separation

A network for audio query-based music source separation that can explicitly encode the source information from a query signal regardless of the number and/or kind of target signals is proposed.

Universal Sound Separation

A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

Scaper: A library for soundscape synthesis and augmentation

Given a collection of iso-lated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, “specification”, to increase the variability of the output.