Learning Environmental Sounds with Multi-scale Convolutional Neural Network

@article{Zhu2018LearningES,
  title={Learning Environmental Sounds with Multi-scale Convolutional Neural Network},
  author={Boqing Zhu and Changjian Wang and Feng Liu and Jin Lei and Zengquan Lu and Yuxing Peng},
  journal={2018 International Joint Conference on Neural Networks (IJCNN)},
  year={2018},
  pages={1-8}
}
Deep learning has dramatically improved the performance of sounds recognition. However, learning acoustic models directly from the raw waveform is still challenging. Current waveform-based models generally use time-domain convolutional layers to extract features. The features extracted by single size filters are insufficient for building discriminative representation of audios. In this paper, we propose multi-scale convolution operation, which can get better audio representation by improving… 

Figures and Tables from this paper

Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion
TLDR
The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.
Learning discriminative and robust time-frequency representations for environmental sound classification
TLDR
A new method is proposed, called time-frequency enhancement block (TFBlock), which temporal attention and frequency attention are employed to enhance the features from relevant frames and frequency bands, which improves the classification performance and also exhibits robustness to noise.
Environmental Sound Classification with Parallel Temporal-Spectral Attention
TLDR
A novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations is proposed, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands.
Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features
TLDR
Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single- level features.
Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification
TLDR
This work proposes an convolutional recurrent neural network model to learn spectro-temporal features and temporal correlations and extends this model with a frame-level attention mechanism to learn discriminative feature representations for environmental sound classification.
Learning Frame Level Attention for Environmental Sound Classification
TLDR
This work proposes a convolutional recurrent neural network model with a frame-level attention mechanism to learn discriminative feature representations for environmental sound classification and achieves the state-of-the-art or competitive classification accuracy with lower computational complexity.
Environmental Sound Classification Based on Multi-temporal Resolution CNN Network Combining with Multi-level Features
TLDR
Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single- level features.
Cross-scale Attention Model for Acoustic Event Classification
TLDR
A cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation, is proposed, which can effectively improve the performance of current state-of-the-art deep learning algorithms.
Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification
TLDR
An end-to-end framework, namely feature pyramid attention network (FPAM), focusing on abstracting the semantically relevant features for ESC is presented, and visualization of attention maps on the spectrograms has been presented, showing that FPAM can focus more on the semantic relevant regions while neglecting the noises.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Learning environmental sounds with end-to-end convolutional neural network
  • Yuji Tokozume, T. Harada
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
This paper proposes a novel end-to-end ESC system using a convolutional neural network (CNN) and achieves a 6.5% improvement in classification accuracy over the state-of-the-art logmel-CNN with the static and delta log-mel feature, simply by combining the system and logMel-CNN.
Dilated convolution neural network with LeakyReLU for environmental sound classification
TLDR
A dilated CNN-based ESC (D-CNN-ESC) system where dilated filters and LeakyReLU activation function are adopted that will increase receptive field of convolution layers to incorporate more contextual information and outperforms state-of-the-art ESC results obtained by very deep CNN- ESC system on UrbanSound8K dataset.
Very deep convolutional neural networks for raw waveforms
TLDR
This work proposes very deep convolutional neural networks that directly use time-domain waveforms as inputs that are efficient to optimize over very long sequences, necessary for processing acoustic waveforms.
Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
TLDR
The experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
TLDR
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection
TLDR
This paper proposes a novel approach to VAD to tackle both feature and model selection jointly and shows that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments.
Learning the speech front-end with raw waveform CLDNNs
TLDR
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.
Novel TEO-based Gammatone features for environmental sound classification
TLDR
Modified Gammatone filterbank with Teager Energy Operator (TEO) with two classifiers, namely, Gaussian Mixture Model (GMM) using cepstral features, and Convolutional Neural Network (CNN) using spectral features are used for environmental sound classification (ESC) task.
Detection and Classification of Acoustic Scenes and Events
TLDR
The state of the art in automatically classifying audio scenes, and automatically detecting and classifyingaudio events is reported on.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
...
...