Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training

@article{Shi2021HodgeAP,
  title={Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training},
  author={Ziqiang Shi and Liu Liu and Huibin Lin and Rujie Liu},
  journal={2020 28th European Signal Processing Conference (EUSIPCO)},
  year={2021},
  pages={1-5}
}
  • Ziqiang Shi, Liu Liu, Rujie Liu
  • Published 13 February 2020
  • Computer Science
  • 2020 28th European Signal Processing Conference (EUSIPCO)
In this paper, we propose a method called Hodge and Podge for sound event detection. We demonstrate Hodge and Podge on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge Task 4. This task aims to predict the presence or absence and the onset and offset times of sound events in home environments. Sound event detection is challenging due to the lack of large scale real strongly labeled data. Recently deep semi-supervised learning (SSL) has proven to… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 22 REFERENCES
HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods
In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and
Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis
TLDR
The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.
Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments
TLDR
This paper presents DCASE 2018 task 4.0, which evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries) and explores the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeling training set to improve system performance.
DCASE 2018 Challenge baseline with convolutional neural networks
TLDR
Python implementation of DCASE 2018 has five tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio tagging; the baseline source code contains the implementation of convolutional neural networks, including AlexNetish and VGGish -- networks originating from computer vision.
DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline
TLDR
A cross-task baseline system for all five tasks based on a convlutional neural network (CNN): a “CNN Baseline” system that implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision.
MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4
TLDR
A mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neuralnetwork (RNN) to maximize the use of unlabeled in-domain dataset is proposed.
Squeeze-and-Excitation Networks
TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
TLDR
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks, but it becomes unwieldy when learning large datasets, so Mean Teacher, a method that averages model weights instead of label predictions, is proposed.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Interpolation Consistency Training for Semi-Supervised Learning
TLDR
Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm, achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets.
...
1
2
3
...