Audio Set Classification with Attention Model: A Probabilistic Perspective

@article{Kong2018AudioSC,
  title={Audio Set Classification with Attention Model: A Probabilistic Perspective},
  author={Qiuqiang Kong and Yong Xu and Wenwu Wang and Mark D. Plumbley},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={316-320}
}
This paper investigates the Audio Set classification. [] Key Method We call each feature as an instance and the collection as a bag following the terminology in multiple instance learning. In the attention model, each instance in the bag has a trainable probability measure for each class. The classification of the bag is the expectation of the classification output of the instances in the bag with respect to the learned probability measure. Experiments show that the proposed attention model achieves a mAP of…

Figures and Tables from this paper

Multi-level Attention Model for Weakly Supervised Audio Classification
TLDR
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
Weakly Labelled AudioSet Tagging With Attention Neural Networks
TLDR
This work bridges the connection between attention neural networks and multiple instance learning (MIL) methods, and proposes decision-level and feature-level attention neural Networks for audio tagging, which achieves a state-of-the-art mean average precision.
Segment Relevance Estimation for Audio Analysis and Weakly-Labelled Classification
TLDR
A neural network architecture, namely RELNET, is proposed that leverages the relevance measure for weakly-labelled audio classification problems and achieved competitive classification results when compared to previous attention-based proposals.
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.
A Deep Residual Network for Large-Scale Acoustic Scene Analysis
TLDR
The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.
Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs
TLDR
From all the classifiers tested, the LSTM neural network showed the best results with a mean average precision and a mean recall of 0.30698, which is particularly relevant since the embeddings provided by Google are used as input to the DNNs, which are sequences of at most 10 feature vectors and therefore limit the sequence modelling capabilities of LSTMs.
Self-supervised Attention Model for Weakly Labeled Audio Event Classification
TLDR
A novel weakly labeled Audio Event Classification approach based on a self-supervised attention model that achieves 8.8% and 17.6% relative mean average precision improvements over the current state-of-the-art systems for SL-DCASE-17and balanced AudioSet.
Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging
TLDR
A novel pooling algorithm is proposed for MIL, named gated multi-head attention pooling (GMAP), which is able to attend to the information of events from different heads at different positions, and increases the modeling power of the single- head attention with no computational overhead.
Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study
TLDR
A dual-branch neural network architecture is developed for the joint learning of voice and acoustic features during an AED process and thorough empirical studies are conducted to examine the performance on the public AudioSet with different types of inputs.
Learning Multi-instrument Classification with Partial Labels
TLDR
This work investigates the use of attention-based recurrent neural networks to address the weakly-labeled problem of multi-instrument recognition and uses different data augmentation methods to mitigate the partially-labeling problem.
...
...

References

SHOWING 1-10 OF 19 REFERENCES
A joint detection-classification model for audio tagging of weakly labelled data
TLDR
This work proposes a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously and shows that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%.
Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
TLDR
A weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events and the attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames.
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Audio Event Detection using Weakly Labeled Data
TLDR
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
Convolutional gated recurrent neural network incorporating spatial features for audio tagging
TLDR
This paper proposes to use a convolutional neural network (CNN) to extract robust features from mel-filter banks, spectrograms or even raw waveforms for audio tagging to evaluate the proposed methods on Task 4 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge.
A review of multi-instance learning assumptions
TLDR
This paper aims to clarify the use of alternative MI assumptions by reviewing the work done in this area, and focuses on a relaxed view of the MI problem, where the standard MI assumption is dropped and alternative assumptions are considered instead.
Multiple instance classification: Review, taxonomy and comparative study
...
...