• Corpus ID: 39821732

Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

  title={Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks},
  author={Muhammad Huzaifah},
Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. [] Key Result Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.

Figures and Tables from this paper

Environmental Sound Classification with Parallel Temporal-Spectral Attention
A novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations is proposed, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands.
Audio representation for environmental sound classification using convolutional neural networks
A convolutional neural network (CNN) training framework is described and implemented and it is shown that the model is relatively robust against wind-noise, the accuracy remains above 60\% until the SNR between signal and wind- noise approaches 9 dB.
Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks
To the best of the knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets and by a considerable margin over the previous models.
Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network
This is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets, and the accuracy achieved by the proposed model is beyond human accuracy.
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database and the usage of spatial and harmonic features are shown to improve the performance of SED.
An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition
A novel stacked CNN model with multiple convolutional layers of decreasing filter sizes is proposed to improve the performance of CNN models with either log-mel feature input or raw waveform input to build the ensemble DS-CNN model for ESC.
Multi-stream Network With Temporal Attention For Environmental Sound Classification
This work introduces a multi-stream convolutional neural network with temporal attention that addresses problems of environmental sound classification systems and achieves new state-of-the-art performance without any changes in network architecture or front-end preprocessing, thus demonstrating better generalizability.
A Real-Time Convolutional Neural Network Based Speech Enhancement for Hearing Impaired Listeners Using Smartphone
A Speech Enhancement (SE) technique based on multi-objective learning convolutional neural network to improve the overall quality of speech perceived by Hearing Aid (HA) users is presented.
CNN and Sound Processing-Based Audio Classifier for Alarm Sound Detection
Artificial neural networks (ANN) has evolved through many stages in the last three decades with many researchers contributing in this challenging field. With the power of math, complex problems can
Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation
This work proposes a novel PSED framework, which incorporates MultiType-Multi-Scale TFRs, and applies a novel approach, to adaptively fuse different models and T FRs symbiotically, so that the overall performance can be significantly improved.


Time–Frequency Matrix Feature Extraction and Classification of Environmental Audio Signals
The results of the numerical simulation support the effectiveness of the proposed approach for environmental audio classification with over 10% accuracy-rate improvement compared to the MFCC features.
Environmental Sound Recognition With Time–Frequency Audio Features
An empirical feature analysis for audio environment characterization is performed and a matching pursuit algorithm is proposed to use to obtain effective time-frequency features to yield higher recognition accuracy for environmental sounds.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
Convolutional Neural Networks for Speech Recognition
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Deep Convolutional Neural Networks for Large-scale Speech Tasks
Deep convolutional neural networks for LVCSR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning