Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework

  title={Impact of Acoustic Event Tagging on Scene Classification in a Multi-Task Learning Framework},
  author={Rahil S. Parikh and Harshavardhan Sundar and Ming Sun and Chao Wang and Spyros Matsoukas},
Acoustic events are sounds with well-defined spectro-temporal characteristics which can be associated with the physical ob-jects generating them. Acoustic scenes are collections of such acoustic events in no specific temporal order. Given this natu-ral linkage between events and scenes, a common belief is that the ability to classify events must help in the classification of scenes. This has led to several efforts attempting to do well on Acoustic Event Tagging (AET) and Acoustic Scene Classi… 

Figures from this paper



Joint Analysis of Acoustic Events and Scenes Based on Multitask Learning

Experimental results obtained indicate that the proposed multitask learning for joint analysis of acoustic events and scenes improves the performance of acoustic event detection by 10.66 percentage points in terms of the F-score, compared with a conventional method based on a convolutional recurrent neural network.

Cross-task pre-training for acoustic scene classification

This work explored cross-task pre-training mechanism to utilize acoustic event information extracted from the pre-trained model to optimize the ASC task and showed that cross- Task Pre- Training mechanism can significantly improve the performance of ASC tasks.

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling

A network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED), which consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time.

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

The aim is to build an integrated system that can serve as a pretrained model to perform the three abovementioned tasks, and demonstrates that the proposed architecture, called DcaseNet, can be either directly used for any of the tasks while providing suitable results or fine-tuned to improve the performance of one task.

A Review of Deep Learning Based Methods for Acoustic Scene Classification

This article summarizes and groups existing approaches for data preparation, i.e., feature representations, feature pre-processing, and data augmentation, and for data modeling, i.

Acoustic Scene Classification Using Audio Tagging

A novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism, which shows effectiveness on the detection and classification of acoustic scenes and events 2019 task 1-a dataset.

A multi-device dataset for urban acoustic scene classification

The acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task are introduced, and the performance of a baseline system in the task is evaluated.

A joint detection-classification model for audio tagging of weakly labelled data

This work proposes a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously and shows that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%.

Towards joint sound scene and polyphonic sound event recognition

It is shown that by taking a joint approach, learning is more efficient and whilst improvements are still needed for sound event detection, SED results are robust in a dataset where the sample distribution is skewed towards sound scenes.

Exploring deep vision models for acoustic scene classification

This report evaluates the application of deep vision models, namely VGG and Resnet, to general audio recognition by training several of these architecture on the task 1 dataset to perform acoustic scene classification and exploring two ensemble methods to aggregate the different model outputs.