Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures

  title={Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures},
  author={Sangwook Park and David K. Han and Mounya Elhilali},
Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-theart methods employ supervised training methods that leverage… 
1 Citations

Figures and Tables from this paper

Sound Event Detection with Cross-Referencing Self-Training

This approach takes advantage of semi-supervised training using pseudo-labels from a balanced student-teacher model, and outperforms DCASE2021 challenge baseline in terms of Poly-phonic Sound event Detection Score.



Self-Training for Sound Event Detection in Audio Mixtures

A self-training technique to leverage unlabeled datasets in supervised learning using pseudo label estimation and a dual-term objective function: a classification loss for the original labels and expectation loss for pseudo labels is proposed.

Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection

This work presents a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network that leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly la-beled event detection.

Semi-supervised Acoustic Event Detection Based on Tri-training

This paper uses an Internet-scale un-labeled dataset with potential domain shift to improve the detection of acoustic events and shows accuracy improvement over both the supervised training baseline, and semi-supervised self-training set-up, in all pre-defined acoustic event detection tasks.

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN).

Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis

The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.


This model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence.

Guided Learning for Weakly-Labeled Semi-Supervised Sound Event Detection

An end-to-end semi-supervised learning process for these two models to enable their abilities to rise alternately and show that this approach achieves competitive performance on the DCASE2018 Task4 dataset.

Sound Event Detection by Consistency Training and Pseudo-Labeling With Feature-Pyramid Convolutional Recurrent Neural Networks

This work proposes FP-CRNN, a convolutional recurrent neural network (CRNN) which contains feature-pyramid (FP) components, to leverage temporal information by utilizing features at different scales to exploit large amount of unlabeled in-domain data efficiently.

Training Sound Event Detection on a Heterogeneous Dataset

This work proposes to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes.

Polyphonic Sound Event Detection Based on Residual Convolutional Recurrent Neural Network With Semi-Supervised Loss Function

A two-stage polyphonic SED model when strongly labeled data are limited but weakly labeled and unlabeled data are available is proposed, and its performance is compared with those of the baseline and top-ranked models from both challenges.