Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition

@inproceedings{Takahashi2016DeepCN,
  title={Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition},
  author={Naoya Takahashi and Michael Gygli and Beat Pfister and Luc Van Gool},
  booktitle={INTERSPEECH},
  year={2016}
}
We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event… 

Figures and Tables from this paper

Temporal Transformer Networks for Acoustic Scene Classification
TLDR
A novel temporal transformer module is proposed to allow the temporal manipulation of data in neural networks, composed of a Fourier transform layer for feature maps and a learnable feature reduction layer, and can be inserted into existing convolutional neural network (CNN) and Long short-term memory (LSTM) models.
SwishNet: A Fast Convolutional Neural Network for Speech, Music and Noise Classification and Segmentation
TLDR
This work proposes a novel 1D Convolutional Neural Network - SwishNet, a fast and lightweight architecture that operates on MFCC features which is suitable to be added to the front-end of an audio processing pipeline and shows that the performance of the network can be improved by distilling knowledge from a 2D CNN, pretrained on ImageNet.
DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection
TLDR
This report presents the proposed audio event detection system, based on convolutional neural networks and deep neural networks coupled with novel weighted and multi-task loss functions and state-of-the-art phase-aware signal enhancement, submitted for DCASE 2017 challenge.
Constrained Learned Feature Extraction for Acoustic Scene Classification
  • Teng Zhang, Ji Wu
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A new learnable module, the simulated Fourier transform module, is described, which allows deep neural networks to implement the discrete Fouriertransform operation 8x faster on a graphics processing unit (GPU).
Acoustic Scene Classification by Combining Autoencoder-based Dimensionality Reduction and Convolutional Neural Networks
TLDR
This work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN), and discusses which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes.
Comparative Assessment of Data Augmentation for Semi-Supervised Polyphonic Sound Event Detection
TLDR
This work proposes a CRNN system exploiting unlabeled data with semi-supervised learning based on the “Mean teacher” method, in combination with data augmentation to overcome the limited size of the training dataset and to further improve the performances.
Weighted and Multi-Task Loss for Rare Audio Event Detection
TLDR
Two loss functions tailored for rare audio event detection in audio streams are presented, designed to tackle the common issue of imbalanced data in background/foreground classification and the multi-task loss enables the networks to simultaneously model the class distribution and the temporal structures of the target events for recognition.
Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos
TLDR
Comparisons over the COGNIMUSE database, consisting of movies and travel documentaries, provided strong evidence that the CNN-based approach for all modalities, even in this task, manages to outperform the hand-crafted frontend in almost all cases, accomplishing really good average results.
Time Series Data Augmentation for Neural Networks by Time Warping with a Discriminative Teacher
TLDR
The guided warping method, which exploits the element alignment properties of Dynamic Time Warping and shapeDTW, a high-level DTW method based on shape descriptors, to deterministically warp sample patterns, is proposed.
Shuffling and Mixing Data Augmentation for Environmental Sound Classification
TLDR
This paper proposes a data augmentation technique that generates new sound by shuffling and mixing two existing sounds of the same class in the dataset that creates new variations on both the temporal sequence and the density of the sound events.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
Exploiting spectro-temporal locality in deep learning based acoustic event detection
TLDR
Two different feature extraction strategies are explored using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED.
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
Improved audio features for large-scale multimedia event detection
TLDR
While the overall finding is that MFCC features perform best, it is found that ANN as well as LSP features provide complementary information at various levels of temporal resolution.
Audio event classification using deep neural networks
TLDR
It is shown that the DNN has some advantage over other classification methods and that fusion of two methods can produce the best results.
Real-world acoustic event detection
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
TLDR
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
A blind segmentation approach to acoustic event detection based on i-vector
TLDR
A new blind segmentation approach to acoustic event detection (AED) based on i-vectors inspired by block-based automatic image annotation in image retrieval tasks, which shows promising results with an average of 8% absolute gain in F1 over the conventional hidden Markov model based approach.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Bag-of-Audio-Words Approach for Multimedia Event Classification
TLDR
Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.
...
1
2
3
4
5
...