• Corpus ID: 236171243

Improving Polyphonic Sound Event Detection on Multichannel Recordings with the Sørensen-Dice Coefficient Loss and Transfer Learning

  title={Improving Polyphonic Sound Event Detection on Multichannel Recordings with the S{\o}rensen-Dice Coefficient Loss and Transfer Learning},
  author={Karn Nichakarn Watcharasupat and Thi Ngoc Tho Nguyen and Ngoc Khanh Nguyen and Zhen Jian Lee and Douglas L. Jones and Woonseng Gan},
The Sørensen–Dice Coefficient has recently seen rising popularity as a loss function (also known as Dice loss) due to its robustness in tasks where the number of negative samples significantly exceeds that of positive samples, such as semantic segmentation, natural language processing, and sound event detection. Conventional training of polyphonic sound event detection systems with binary crossentropy loss often results in suboptimal detection performance as the training is often overwhelmed by… 

Figures and Tables from this paper



Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

This work combines these two approaches in a convolutional recurrent neural network (CRNN) and applies it on a polyphonic sound event detection task and observes a considerable improvement for four different datasets consisting of everyday sound events.

Impact of Sound Duration and Inactive Frames on Sound Event Detection Performance

This paper investigates the impact of sound duration and inactive frames on SED performance by introducing four loss functions, such as simple reweighting loss, inverse frequency loss, asymmetric focal loss, and focal batch Tversky loss.


The proposed method employs Conformer, a convolution-augmented Transformer that is able to exploit local features of audio data more effectively using CNNs, while global features are captured with Transformer, which uses semi-supervised learning and data augmentation.

Metrics for Polyphonic Sound Event Detection

This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources

Sound event detection using spatial features and convolutional recurrent neural network

This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection and shows that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multich channel audio better when they are presented as separate layers of a volume.

A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

To investigate the individual and combined effects of ambient noise, interferers, and reverberation, the performance of the baseline on different versions of the dataset excluding or including combinations of these factors indicates that by far the most detrimental effects are caused by directional interferers.

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

Dice Loss for Data-imbalanced NLP Tasks

This paper proposes to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks, based on the Sørensen--Dice coefficient or Tversky index, which attaches similar importance to false positives and false negatives, and is more immune to the data-IMbalance issue.

TUT database for acoustic scene classification and sound event detection

The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.