HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods

@inproceedings{Shi2019HODGEPODGESE,
  title={HODGEPODGE: Sound Event Detection Based on Ensemble of Semi-Supervised Learning Methods},
  author={Ziqiang Shi and Liu Liu and Huibin Lin and Rujie Liu and Anyan Shi},
  booktitle={DCASE},
  year={2019}
}
In this paper, we present a method called HODGEPODGE\footnotemark[1] for large-scale detection of sound events using weakly labeled, synthetic, and unlabeled data proposed in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge Task 4: Sound event detection in domestic environments. To perform this task, we adopted the convolutional recurrent neural networks (CRNN) as our backbone network. In order to deal with a small amount of tagged data and a large amounts… 

Figures and Tables from this paper

Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training
TLDR
This work explores how to extend deep SSL to result in a new, state-of-the-art sound event detection method called Hodge and Podge, and proposes multi-hot MixMatch and composition consistency training with temporal-frequency augmentation.
Sound Event Detection by Consistency Training and Pseudo-Labeling With Feature-Pyramid Convolutional Recurrent Neural Networks
TLDR
This work proposes FP-CRNN, a convolutional recurrent neural network (CRNN) which contains feature-pyramid (FP) components, to leverage temporal information by utilizing features at different scales to exploit large amount of unlabeled in-domain data efficiently.
Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis
TLDR
The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.
Non-Negative Matrix Factorization-Convolutional Neural Network (NMF-CNN) For Sound Event Detection
TLDR
A deep learning model that integrates Non-Negative Matrix Factorization (NMF) with Convolutional Neural Network (CNN) to use NMF to provide an approximate strong label to the weakly labeled data is proposed.
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.
Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization
TLDR
This work is the first to jointly consider audio and video modalities for supervised TAL and experimentally shows that its schemes consistently improve performance for state of the art video-only TAL approaches.
Sound Event Detection in Synthetic Domestic Environments
TLDR
A comparative analysis of the performance of state-of-the-art sound event detection systems based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics.
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.
Soft-Median Choice: An Automatic Feature Smoothing Method for Sound Event Detection
TLDR
A novel automatic feature smoothing algorithm based on Soft-Median Choice that obtains significantly better scores than the referential algorithms is proposed.

References

SHOWING 1-10 OF 16 REFERENCES
Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis
TLDR
The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.
Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments
TLDR
This paper presents DCASE 2018 task 4.0, which evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries) and explores the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeling training set to improve system performance.
DCASE 2018 Challenge baseline with convolutional neural networks
TLDR
Python implementation of DCASE 2018 has five tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio tagging; the baseline source code contains the implementation of convolutional neural networks, including AlexNetish and VGGish -- networks originating from computer vision.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
TLDR
The emergence of deep learning as the most popular classification method is observed, replacing the traditional approaches based on Gaussian mixture models and support vector machines.
DCASE 2018 Challenge Surrey cross-task convolutional neural network baseline
TLDR
A cross-task baseline system for all five tasks based on a convlutional neural network (CNN): a “CNN Baseline” system that implemented CNNs with 4 layers and 8 layers originating from AlexNet and VGG from computer vision.
MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4
TLDR
A mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neuralnetwork (RNN) to maximize the use of unlabeled in-domain dataset is proposed.
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
TLDR
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
Interpolation Consistency Training for Semi-Supervised Learning
TLDR
Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm, achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
TLDR
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks, but it becomes unwieldy when learning large datasets, so Mean Teacher, a method that averages model weights instead of label predictions, is proposed.
...
1
2
...