• Corpus ID: 220447185

Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

@article{Takahashi2016DeepCN,
  title={Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection},
  author={Naoya Takahashi and Michael Gygli and Beat Pfister and Luc Van Gool},
  journal={arXiv: Sound},
  year={2016}
}
We propose a novel method for Acoustic Event Detection (AED). In contrast to speech, sounds coming from acoustic events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of a clear sub-word unit. In order to incorporate the long-time frequency structure for AED, we introduce a convolutional neural network (CNN) with a large input field. In contrast to previous works, this enables to train audio event… 
AENet: Learning Deep Audio Features for Video Analysis
TLDR
A convolutional neural network operating on a large temporal input allows for an audio event detection system end to end and performs transfer learning and shows that the model learned generic audio features, similar to the way CNNs learn generic features on vision tasks.
Evaluation of Modulation-MFCC Features and DNN Classification for Acoustic Event Detection
TLDR
Traditional techniques and different deep learning architectures are used, including convolutional and recurrent models in the context of real life everyday audio recordings in realistic, however challenging, multisource conditions.
Weakly and semi-supervised learning for sound event detection using image pretrained convolutional recurrent neural network, weighted pooling and mean teacher method
TLDR
A sound event detection (SED) method which uses deep neural network trained on weak labeled and unlabeled data and outperforms the DCASE2021 Task4 baseline method.
Attention Based CLDNNs for Short-Duration Acoustic Scene Classification
TLDR
This work applies the CLDNNs (Convolutional, Long Short-Term Memory, Deep Neural Networks) framework to short-duration acoustic scene classification in a unified architecture and achieves higher performance compared to the conventional neural network architectures.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Semi-supervised Acoustic Event Detection Based on Tri-training
TLDR
This paper uses an Internet-scale un-labeled dataset with potential domain shift to improve the detection of acoustic events and shows accuracy improvement over both the supervised training baseline, and semi-supervised self-training set-up, in all pre-defined acoustic event detection tasks.
Compression of Acoustic Event Detection Models With Quantized Distillation
TLDR
This paper presents a simple yet effective compression approach which jointly leverages knowledge distillation and quantization to compress larger network (teacher model) into compact network (student model) and shows proposed technique not only lowers error rate of original compact network by 15% through distillation but also further reduces its model size to a large extent.
Mmdenselstm: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation
TLDR
A novel architecture that integrates long short-term memory (LSTM) in multiple scales with skip connections to efficiently model long-term structures within an audio context is proposed and yields better results than those obtained using ideal binary masks for a singing voice separation task.
Hierarchical Sound Event Classification
TLDR
A model composed of a preprocessing layer that converts audio to a log-mel spectrogram, a VGG-inspired Convolutional Neural Network that generates an embedding for the spectprogram, a pre-trained VGGish network that generates a separate audio embedding, and finally a series of fully-connected layers that converts these two embeddings into a multi-label classification.
Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification
TLDR
This work proposes an convolutional recurrent neural network model to learn spectro-temporal features and temporal correlations and extends this model with a frame-level attention mechanism to learn discriminative feature representations for environmental sound classification.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
Exploiting spectro-temporal locality in deep learning based acoustic event detection
TLDR
Two different feature extraction strategies are explored using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED.
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
Improved audio features for large-scale multimedia event detection
TLDR
While the overall finding is that MFCC features perform best, it is found that ANN as well as LSP features provide complementary information at various levels of temporal resolution.
Audio event classification using deep neural networks
TLDR
It is shown that the DNN has some advantage over other classification methods and that fusion of two methods can produce the best results.
Real-world acoustic event detection
A blind segmentation approach to acoustic event detection based on i-vector
TLDR
A new blind segmentation approach to acoustic event detection (AED) based on i-vectors inspired by block-based automatic image annotation in image retrieval tasks, which shows promising results with an average of 8% absolute gain in F1 over the conventional hidden Markov model based approach.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Bag-of-Audio-Words Approach for Multimedia Event Classification
TLDR
Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
TLDR
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
...
1
2
3
4
5
...