Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-supervised Sound Event Detection

@article{Ebbers2021ForwardBackwardCR,
  title={Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-supervised Sound Event Detection},
  author={Janek Ebbers and Reinhold Haeb-Umbach},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.06581}
}
In this paper we present our system for thedetection and classi-fication of acoustic scenes and events (DCASE) 2020 ChallengeTask 4: Sound event detection and separation in domestic envi-ronments. We introduce two new models: the forward-backwardconvolutional recurrent neural network (FBCRNN) and the tag-conditioned convolutional neural network (CNN). The FBCRNNemploys two recurrent neural network (RNN) classifiers sharing thesame CNN for preprocessing. With one RNN processing a record-ing in… 

Figures and Tables from this paper

Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments
TLDR
This paper uses a forward-backward convolutional recurrent neural network for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN and introduces a strong label loss in the objective of the F BCRNN to take advantage of the strongly labeled synthetic data during training.
SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS FOR DCASE
TLDR
The main module in the MLFL system is named MLFL, which uses metric learning and focal loss, adopts the weakly-supervised learning framework with an attention-based embedding-level pooling module and the mean-teacher method for semi- supervised learning.
Peer Collaborative Learning for Polyphonic Sound Event Detection
This paper describes that semi-supervised learning called peer collaborative learning (PCL) can be applied to the polyphonic sound event detection (PSED) task, which is one of the tasks in the
Couple Learning for semi-supervised sound event detection
TLDR
An effective Couple Learning method 1 that combines a well-trained model and a Mean Teacher model that improves the Mean Teacher method’s performance and reduces the noise impact in the pseudo-labels introduced by detection errors is proposed.
Adapting Sound Recognition to A New Environment Via Self-Training
TLDR
This paper proposes a self-training based domain adaptation approach, which only requires unlabeled data from the target environment on which a student network is trained and shows that the student significantly improves recognition performance over the pre-trained teacher without relying on labeledData from the environment the system is deployed in.
COUPLE LEARNING: MEAN TEACHER WITH PLG MODEL IMPROVES THE RESULTS OF SOUND EVENT DETECTION
TLDR
An effective Couple Learning method that combines a well-trained model and a Mean Teacher model that reduces the noise impact in the pseudo-labels introduced by detection errors and increases strongly and weakly-labeled data to improve the Mean Teacher method’s performance.
Couple Learning: Mean Teacher method with pseudo-labels improves semi-supervised deep learning results
TLDR
Experimental results on Task 4 of the DCASE2020 challenge demonstrate the superiority of the proposed Couple Learning method, achieving about 39.18% F1-score on public eval set, outperforming the baseline system’s 37.12% by a significant margin.

References

SHOWING 1-10 OF 25 REFERENCES
Convolutional Recurrent Neural Network and Data Augmentation for Audio Tagging with Noisy Labels and Minimal Supervision
TLDR
This paper proposes a model consisting of a convolutional front end using log-mel-energies as input features, a recurrent neural network sequence encoder and a fully connected classifier network outputting an activity probability for each of the 80 considered event classes.
Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network
TLDR
A stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label, which achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split is proposed.
Learning to Recognize Transient Sound Events using Attentional Supervision
TLDR
This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV), establishing a new state-of-theart for DCASE17 and AudioSet data sets.
Weakly-Supervised Sound Event Detection with Self-Attention
TLDR
A novel sound event detection method that incorporates a self-attention mechanism of the Transformer for a weakly-supervised learning scenario and introduces a special tag token into the input sequence for weak label prediction, which enables the aggregation of the whole sequence information.
Multi-level Attention Model for Weakly Supervised Audio Classification
TLDR
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
Guided Learning Convolution System for DCASE 2019 Task 4
TLDR
The system submitted to DCASE2019 task 4: sound event detection (SED) in domestic environments with a convolutional neural network with an embedding-level attention pooling module achieves the best performance compared to those of other participates.
Adaptive Pooling Operators for Weakly Labeled Sound Event Detection
TLDR
This paper treats SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality, and develops a family of adaptive pooling operators—referred to as autopool—which smoothly interpolate between common pooling Operators, and automatically adapt to the characteristics of the sound sources in question.
A Closer Look at Weak Label Learning for Audio Events
TLDR
This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.
CONVOLUTION-AUGMENTED TRANSFORMER FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report
TLDR
This model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
...
...