Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network

@article{Xu2018LargeScaleWS,
  title={Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network},
  author={Yong Xu and Qiuqiang Kong and Wenwu Wang and Mark D. Plumbley},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={121-125}
}
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labelled with one or more audio tags, but without time stamps of the audio events, hence referred to… 

Figures and Tables from this paper

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection
TLDR
This paper proposes a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels and shows that audio embeddings extracted by the convolutional neural networks significantly boost the performance of all MIL models.
Weakly Labelled AudioSet Tagging With Attention Neural Networks
TLDR
This work bridges the connection between attention neural networks and multiple instance learning (MIL) methods, and proposes decision-level and feature-level attention neural Networks for audio tagging, which achieves a state-of-the-art mean average precision.
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling
TLDR
A network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED), which consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time.
Multi-level Attention Model for Weakly Supervised Audio Classification
TLDR
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
Audio Tagging With Connectionist Temporal Classification Model Using Sequential Labelled Data
TLDR
A Convolutional Recurrent Neural Network followed by a Connectionist Temporal Classification (CRNN-CTC) objective function to map from an audio clip spectrogram to sequential labelled data (SLD), where both the presence or absence and the order information of sound events are known.
A Comparison of Attention Mechanisms of Convolutional Neural Network in Weakly Labeled Audio Tagging
TLDR
The results show that the performance of attention based on GLU is better than the performance on the SE block in CRNN for weakly labeled polyphonic audio tagging.
Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection
TLDR
This work addresses Sound Event Detection in the case where a weakly annotated dataset is available for training, and explores an approach inspired by Multiple Instance Learning, in which a convolutional recurrent neural network is trained to give predictions at frame-level using a custom loss function based on the weak labels and the statistics of the frame-based predictions.
Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization
TLDR
A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN).
A Region Based Attention Method for Weakly Supervised Sound Event Detection and Classification
TLDR
A novel region based attention method is proposed to further boost the representation power of the existing GLU based CRNN, which extracts region features from multi-scale sliding windows over higher convolutional layers, which are fed into an attention-based recurrent neural network.
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
TLDR
This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments, selected from the Google AudioSet dataset.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Convolutional gated recurrent neural network incorporating spatial features for audio tagging
TLDR
This paper proposes to use a convolutional neural network (CNN) to extract robust features from mel-filter banks, spectrograms or even raw waveforms for audio tagging to evaluate the proposed methods on Task 4 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge.
Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
TLDR
A weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events and the attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames.
FRAMECNN : A WEAKLY-SUPERVISED LEARNING FRAMEWORK FOR FRAME-WISE ACOUSTIC EVENT DETECTION AND CLASSIFICATION
In this paper, we describe our contribution to the challenge of detection and classification of acoustic scenes and events (DCASE2017). We propose framCNN, a novel weakly-supervised learning
ENSEMBLE OF CONVOLUTIONAL NEURAL NETWORKS FOR WEAKLY-SUPERVISED SOUND EVENT DETECTION USING MULTIPLE SCALE INPUT
TLDR
The proposed model, an ensemble of convolutional neural networks to detect audio events in the automotive environment, achieved the 2nd place on audio tagging and the 1st place on sound event detection.
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
Audio Event Detection using Weakly Labeled Data
TLDR
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
A joint detection-classification model for audio tagging of weakly labelled data
TLDR
This work proposes a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously and shows that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%.
Sound event detection using spatial features and convolutional recurrent neural network
TLDR
This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection and shows that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multich channel audio better when they are presented as separate layers of a volume.
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
TLDR
This work combines these two approaches in a convolutional recurrent neural network (CRNN) and applies it on a polyphonic sound event detection task and observes a considerable improvement for four different datasets consisting of everyday sound events.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
...
...