A comparison of Deep Learning methods for environmental sound detection

  title={A comparison of Deep Learning methods for environmental sound detection},
  author={Juncheng Billy Li and Wei Dai and Florian Metze and Shuhui Qu and Samarjit Das},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
Environmental sound detection is a challenging application of machine learning because of the noisy nature of the signal, and the small amount of (labeled) data that is typically available. This work thus presents a comparison of several state-of-the-art Deep Learning models on the IEEE challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge task and data, classifying sounds into one of fifteen common indoor and outdoor acoustic scenes, such as bus, cafe… 

Figures and Tables from this paper

Deep Learning Based Audio Scene Classification

This work aims to develop a Deep Neural Network (DNN) based system to detect the real life environments by analyzing their sound data by usingLog Mel band features are used to represent the characteristics of the input audio scenes.

Acoustic scene classification: from a hybrid classifier to deep learning

Two approaches for the acoustic scene classification task were investigated and a combination of features in the time and frequency domain and a hybrid Support Vector Machines Hidden Markov Model classi⬁er was used to achieve an average accuracy over 4-folds.

Background Sound Classification in Speech Audio Segments

This work prepares a new dataset YBSS-200 using youtube videos where each sample contains a distinct background sound and an accompanying foreground human voice and presents a convolutional neural network based transfer learning approach using a VGG like Network for classification of context in such acoustic signals.

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)

The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database and the usage of spatial and harmonic features are shown to improve the performance of SED.

An Analysis of Audio Classification Techniques using Deep Learning Architectures

The steps in the experiments performed in the newly designed CF Model and CF Clean Model in both CNN and RNN are shown and the results with some existing models such as DCNN and PiczakCNN are compared.

Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events

This study indicates that recognizing acoustic scenes by identifying distinct sound events is effective and paves the way for future studies that combine this strategy with previous ones.


This work proposes a deep learning framework applied for Acoustic Scene Classification (ASC), targeting DCASE2019 task 1A, which shows a combination of three types of spectrograms: Gammatone, log-Mel and Constant Q Transform.

Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks

To the best of the knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets and by a considerable margin over the previous models.

A Re-trained Model Based On Multi-kernel Convolutional Neural Network for Acoustic Scene Classification

This paper proposes a deep learning framework applied for Acoustic Scene Classification (ASC), which identifies recording location. In general, we apply three types of spectrograms: Gammatone (GAM),



Acoustic Scene Recognition with Deep Neural Networks ( DCASE challenge 2016 )

This work aims to discriminatively characterize sound in 15 common indoor and outdoor acoustic scenes by classifying audio recordings from the ongoing IEEE challenge on Detection and Classification of Acoustic Scenes and Events, and finds that deep learning models compare favorably to traditional models.

Polyphonic sound event detection using multi label deep neural networks

Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.

Convolutional neural networks for acoustic modeling of raw time signal in LVCSR

It is shown that the performance gap between DNNs trained on spliced hand-crafted features and DNN's trained on raw time signal can be strongly reduced by introducing 1D-convolutional layers.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations

A method that bypasses the supervised construction of class models is presented, which learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix.


This report describes the 4 submissions for Task 1 (Audio scene classification) of the DCASE-2016 challenge of the CP-JKU team and proposes a novel i-vector extraction scheme for ASC using both left and right audio channels and a Deep Convolutional Neural Network architecture trained on spectrograms of audio excerpts in end-to-end fashion.

Deep Speech: Scaling up end-to-end speech recognition

Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.


The use of Recurrence Quantification Analysis (RQA) features are explored for the scene classification task of the IEEE AASP Challenge for Detection and Classification of Acoustic Scenes and Events and improve accuracy when using a standard SVM classifier.

Speech recognition with deep recurrent neural networks

This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.