Learning environmental sounds with end-to-end convolutional neural network

@article{Tokozume2017LearningES,
  title={Learning environmental sounds with end-to-end convolutional neural network},
  author={Yuji Tokozume and Tatsuya Harada},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2017},
  pages={2721-2725}
}
  • Yuji Tokozume, T. Harada
  • Published 3 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Environmental sound classification (ESC) is usually conducted based on handcrafted features such as the log-mel feature. [] Key Result Moreover, we achieve a 6.5% improvement in classification accuracy over the state-of-the-art logmel-CNN with the static and delta log-mel feature, simply by combining our system and logmel-CNN.

Figures and Tables from this paper

Learning Environmental Sounds with Multi-scale Convolutional Neural Network
TLDR
A novel end-to-end network called WaveMsNet is proposed based on the multi-scale convolution operation and two-phase method, which can get better audio representation by improving the frequency resolution and learning filters cross all frequency area.
Environment Sound Event Classification With a Two-Stream Convolutional Neural Network
TLDR
This paper proposes a two-stream convolutional neural network (CNN) based on raw audio CNN (RACNN) and logmel CNN (LMCNN), and a pre-emphasis module is first constructed to deal with raw audio signal.
Environment Sound Classification Using a Two-Stream CNN Based on Decision-Level Fusion
TLDR
The proposed TSCNN-DS model achieves a classification accuracy of 97.2%, which is the highest taxonomic accuracy on UrbanSound8K datasets compared to existing models.
An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition
TLDR
A novel stacked CNN model with multiple convolutional layers of decreasing filter sizes is proposed to improve the performance of CNN models with either log-mel feature input or raw waveform input to build the ensemble DS-CNN model for ESC.
Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks
TLDR
To the best of the knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets and by a considerable margin over the previous models.
End-To-End Auditory Object Recognition Via Inception Nucleus
TLDR
A novel end-to-end deep neural network to map the raw waveform inputs to sound class labels and includes an "inception nucleus" that optimizes the size of convolutional filters on the fly that results in reducing engineering efforts dramatically.
Deep Convolutional Neural Network with Mixup for Environmental Sound Classification
TLDR
A novel deep convolutional neural network is proposed to be used for environmental sound classification (ESC) tasks that uses stacked Convolutional and pooling layers to extract high-level feature representations from spectrogram-like features.
Multi-channel Convolutional Neural Networks with Multi-level Feature Fusion for Environmental Sound Classification
TLDR
The proposed method outperforms the state-of-the-art end-to-end methods for environmental sound classification in terms of the classification accuracy and is Inspired by VGG networks.
Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network
TLDR
This is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets, and the accuracy achieved by the proposed model is beyond human accuracy.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
Speech acoustic modeling from raw multichannel waveforms
TLDR
A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.
Acoustic modeling with deep neural networks using raw time signal for LVCSR
TLDR
Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, the DNN is trained on a combination of multiple short-term features, illustrating how the Dnn can learn from the little differences between MFCC, PLP and Gammatone features.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
TLDR
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
Learning the speech front-end with raw waveform CLDNNs
TLDR
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.
Convolutional Neural Networks for Speech Recognition
TLDR
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning
TLDR
This work introduces a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains, and demonstrates that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
ESC: Dataset for Environmental Sound Classification
TLDR
A new annotated collection of 2000 short clips comprising 50 classes of various common sound events, and an abundant unified compilation of 250000 unlabeled auditory excerpts extracted from recordings available through the Freesound project are presented.
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
...
...