Improving Target Sound Extraction with Timestamp Information

@inproceedings{Wang2022ImprovingTS,
  title={Improving Target Sound Extraction with Timestamp Information},
  author={Helin Wang and Dongchao Yang and Chao Weng and Jianwei Yu and Yuexian Zou},
  booktitle={Interspeech},
  year={2022}
}
Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this paper, we study to utilize such timestamp information to help extract the target sound via a target… 

Figures and Tables from this paper

Detect What You Want: Target Sound Detection

A novel target sound detection network (TSDNet) is presented which consists of two main parts: A conditional network which aims at generating a sound-discriminative conditional embedding vector representing the target sound, and a detection network which takes both the mixture audio and the conditionalembedding vector as inputs and produces the detection result of thetarget sound.

Few-shot learning of new sound classes for target sound extraction

This work proposes combining 1-hotand enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes, and proposes adapting the embedding vectors obtained from a few enrollment audio samples to further improve performance on new classes.

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

This work proposes a source separation framework trained with weakly labelled data that can separate 527 kinds of sound classes from AudioSet within a single system.

Environmental Sound Classification with Parallel Temporal-Spectral Attention

A novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations is proposed, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands.

One-Shot Conditional Audio Filtering of Arbitrary Sounds

We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a

Audio Query-based Music Source Separation

A network for audio query-based music source separation that can explicitly encode the source information from a query signal regardless of the number and/or kind of target signals is proposed.

What Affects the Performance of Convolutional Neural Networks for Audio Event Classification

This paper designs convolutional neural networks for audio event classification (called FPNet), and on the environmental sounds dataset ESC-50, the classification accuracies of FPNet-1D andFPNet-2D achieve 73.90% and 85.10% respectively, which improve significantly comparing to the previous methods.

Learning to Separate Sounds from Weakly Labeled Scenes

This work proposes objective functions and network architectures that enable training a source separation system with weak labels, and benchmarks performance using synthetic mixtures of overlapping sound events recorded in urban environments.

Recurrent neural networks for polyphonic sound event detection in real life recordings

In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.