SpecAugment for Sound Event Detection in Domestic Environments using Ensemble of Convolutional Recurrent Neural Networks

  title={SpecAugment for Sound Event Detection in Domestic Environments using Ensemble of Convolutional Recurrent Neural Networks},
  author={Wootaek Lim},
In this paper, we present a method to detect sound events in domestic environments using small weakly labeled data, large unlabeled data, and strongly labeled synthetic data as proposed in the Detection and Classification of Acoustic Scenes and Events 2019 Challenge task 4. To solve the problem, we use a convolutional recurrent neural network composed of stacks of convolutional neural networks and bi-directional gated recurrent units. Moreover, we propose various methods such as SpecAugment… 

Figures and Tables from this paper

Comparative Assessment of Data Augmentation for Semi-Supervised Polyphonic Sound Event Detection
This work proposes a CRNN system exploiting unlabeled data with semi-supervised learning based on the “Mean teacher” method, in combination with data augmentation to overcome the limited size of the training dataset and to further improve the performances.
On Open-Set Classification with L3-Net Embeddings for Machine Listening Applications
  • Kevin Wilkinghoff
  • Computer Science
    2020 28th European Signal Processing Conference (EUSIPCO)
  • 2021
A neural network that combines all L3-Net embeddings belonging to one recording into a single vector by using an x-vector mechanism as well as an open-set classification system based on that are presented.


Weakly labeled semi-supervised sound event detection using CRNN with inception module
By applying the proposed method to a weakly labeled semi-supervised sound event detection, it was verified that the proposed system provides better performance compared to the DCASE 2018 baseline system.
Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis
The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
The emergence of deep learning as the most popular classification method is observed, replacing the traditional approaches based on Gaussian mixture models and support vector machines.
Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection
This work presents a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network that leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly la-beled event detection.
Sound Event Detection from Partially Annotated Data: Trends and Challenges
A detailed analysis of the impact of the time segmentation, the event classification and the methods used to exploit unlabeled data on the final performance of sound event detection systems is proposed.
Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments
This paper presents DCASE 2018 task 4.0, which evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries) and explores the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeling training set to improve system performance.
DCASE 2018 Challenge baseline with convolutional neural networks
Python implementation of DCASE 2018 has five tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio tagging; the baseline source code contains the implementation of convolutional neural networks, including AlexNetish and VGGish -- networks originating from computer vision.
A mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neuralnetwork (RNN) to maximize the use of unlabeled in-domain dataset is proposed.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network
A database recorded in one living home, over a period of one week, containing activities being performed in a spontaneous manner, which make use of an acoustic sensor network, and are recorded as a continuous stream is introduced.