• Corpus ID: 9130552

Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data

@article{Kumar2017DeepCF,
  title={Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data},
  author={Anurag Kumar and Bhiksha Raj},
  journal={ArXiv},
  year={2017},
  volume={abs/1707.02530}
}
The development of audio event recognition models requires labeled training data, which are generally hard to obtain. One promising source of recordings of audio events is the large amount of multimedia data on the web. In particular, if the audio content analysis must itself be performed on web audio, it is important to train the recognizers themselves from such data. Training from these web data, however, poses several challenges, the most important being the availability of labels : labels… 

Figures and Tables from this paper

Data-efficient weakly supervised learning for low-resource audio event detection using deep learning
TLDR
A data-efficient training of a stacked convolutional and recurrent neural network is proposed in a multi instance learning setting for which a new loss function is introduced that leads to improved training compared to the usual approaches for weakly supervised learning.
Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes
TLDR
This work describes a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data and proposes methods to learn representations using this model which can be effectively used for solving the target task.
Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
TLDR
A deep convolutional neural network model called DSNet based on densely connected convolution networks (DenseNets) and squeeze-and-excitation networks (SENets) for weakly supervised training of AED is introduced, which alleviates the vanishing-gradient problem and strengthens feature propagation and models interdependencies between channels.
Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network
TLDR
A stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label, which achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split is proposed.
A Closer Look at Weak Label Learning for Audio Events
TLDR
This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.
Deep Learning for Audio Event Detection and Tagging on Low-Resource Datasets
TLDR
This paper proposes factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets.
Sound Event Detection by Pseudo-Labeling in Weakly Labeled Dataset
TLDR
A more efficient model is constructed by employing a gated linear unit (GLU) and dilated convolution to improve the problems of de-emphasizing importance and lack of receptive field and a pseudo-label-based learning for classifying target contents and unknown contents is proposed by adding ’noise label’ and ‘noise loss’ so that unknown contents can be separated as much as possible through the noise label.
Sound Event Classification and Detection with Weakly Labeled Data
TLDR
Two methods for joint SEC and SED using weakly labeled data are proposed: a Fully Convolutional Network (FCN) and a novel method that combines a convolutional Neural Network with an attention layer (CNNatt).
Class-aware Self-Attention for Audio Event Recognition
TLDR
A novel class-aware self-attention mechanism with attention factor sharing to generate discriminative clip-level features for audio event recognition and is able to learn new audio events with a few training examples effectively and efficiently without disturbing the previously learned audio events.
Deep Learning for Audio Transcription on Low-Resource Datasets
TLDR
This paper proposes factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Audio Event Detection using Weakly Labeled Data
TLDR
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Robust Audio Event Recognition with 1-Max Pooling Convolutional Neural Networks
TLDR
The proposed CNN not only shows state-of-the-art performance on the standard task of robust audio event recognition but also outperforms other deep architectures up to 4.5% in terms of recognition accuracy, which is equivalent to 76.3% relative error reduction.
Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching
TLDR
It is important to combine strong complementary features from multiple modalities for multimedia event detection, and cross-frame matching is helpful in coping with temporal order variation.
E-LAMP: integration of innovative ideas for multimedia event detection
TLDR
The core methods and technologies of the framework developed recently for Event Labeling through Analytic Media Processing (E-LAMP) system are introduced and a novel algorithm is developed to learn a more robust and discriminative intermediate feature representation from multiple features so that better event models can be built upon it.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
TLDR
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Dropout: a simple way to prevent neural networks from overfitting
TLDR
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
...
...