• Corpus ID: 39821732

Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

@article{Huzaifah2017ComparisonOT,
  title={Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks},
  author={Muhammad Huzaifah},
  journal={ArXiv},
  year={2017},
  volume={abs/1706.07156}
}
Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. [] Key Result Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.

Figures and Tables from this paper

Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks
TLDR
To the best of the knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets and by a considerable margin over the previous models.
Spectrogram Transformers for Audio Classification
TLDR
Spectrogram Transformers are a group of transformer-based models for audio classification that outper-form the state-of-the-art methods on ESC-50 dataset without pre-training stage and shows great efficiency compared with other leading methods.
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
TLDR
The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database and the usage of spatial and harmonic features are shown to improve the performance of SED.
An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition
TLDR
A novel stacked CNN model with multiple convolutional layers of decreasing filter sizes is proposed to improve the performance of CNN models with either log-mel feature input or raw waveform input to build the ensemble DS-CNN model for ESC.
Multi-stream Network With Temporal Attention For Environmental Sound Classification
TLDR
This work introduces a multi-stream convolutional neural network with temporal attention that addresses problems of environmental sound classification systems and achieves new state-of-the-art performance without any changes in network architecture or front-end preprocessing, thus demonstrating better generalizability.
Slice Bispectrogram Analysis-Based Classification of Environmental Sounds Using Convolutional Neural Network
 Abstract — Certain systems can function well only if they recognize the sound environment as humans do. In this research, we focus on sound classification by adopting a convolutional neural network
LD-CNN: A Lightweight Dilated Convolutional Neural Network for Environmental Sound Classification
TLDR
A lightweight D-CNN (termed as LD-CNN) ESC system is developed, motivated by the finding that the features of the environmental sounds have weak absolute locality property and a global sum operation can be applied to compress the feature map.
Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation
TLDR
A novel PSED framework that incorporates Multi-Type-Multi-Scale TFRs, which can reveal acoustics patterns in a complementary manner and achieves a 7% reduction in error rate compared with the state-of-the-art solutions on the TUT-SED 2016 dataset.
A Light-Weight Deep Convolutional Neural Network for Speech Emotion Recognition using Mel-Spectrograms
TLDR
A lightweight deep convolutional neural network architecture, which utilizes only partial component of the AlexNet with Log-Mel-Spectrograms as input, is proposed, which can achieve a comparable recognition rate with the state of the art.
Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features
TLDR
The main contribution of this work is the demonstration that transferring audio contextual knowledge using CNNs as feature extractors can significantly improve the performance of the audio classifier, without need for CNN training.
...
...

References

SHOWING 1-10 OF 33 REFERENCES
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
TLDR
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
Deep Convolutional Neural Networks for Large-scale Speech Tasks
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions
TLDR
A novel feature extraction method for sound event classification, based on the visual signature extracted from the sound's time-frequency representation, which shows a significant improvement over other methods in mismatched conditions, without the need for noise reduction.
Environmental sound recognition: A survey
TLDR
This survey will offer a qualitative and elucidatory survey on recent developments of environmental sound recognition, and includes three parts: i) basic environmental sound processing schemes, ii) stationary ESR techniques and iii) non-stationary E SR techniques.
Audio analysis for surveillance applications
TLDR
The proposed hybrid solution is capable of detecting new kinds of suspicious audio events that occur as outliers against a background of usual activity and adaptively learns a Gaussian mixture model to model the background sounds and updates the model incrementally as new audio data arrives.
Convoluted Feelings Convolutional and recurrent nets for detecting emotion from audio data
TLDR
A Convolutional Neural Network model to extract features from audio data and applies this model to the task of classifying emotion from speech data, achieving an accuracy of 50% for 7-class classification using CNN-extracted features on 500ms audio patches.
...
...