Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

  title={Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection},
  author={Shao-Yen Tseng and Juncheng Billy Li and Yun Wang and Florian Metze and Joseph Szurley and Samarjit Das},
State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framework uses audio embeddings extracted from a pre-trained convolutional neural network as… 

Figures and Tables from this paper

Improving weakly supervised sound event detection with self-supervised auxiliary tasks
This paper proposes a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task and proposes a two-step attention pooling mechanism that provides a time-frequency localisation of multiple audio events in the clip.
A Global-Local Attention Framework for Weakly Labelled Audio Tagging
  • Helin Wang, Yuexian Zou, Wenwu Wang
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A novel two-stream framework for audio tagging by exploiting the global and local information of sound events and can significantly improve the performance of audio tagging under different baseline network architectures.
Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention
A novel attention mechanism, namely, spatial and channel-wise attention (SCA), that can be employed into any CNNs seamlessly with affordable overheads and is end-to-end trainable fashion is proposed.
Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
This work uses multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled), and shows how MIL can be applied to localize foreground speech in coarsely labeled audio and shows both bag-level and instance-level results.
Deep progressive multi-scale attention for acoustic event classification
The proposed MSA model effectively improved the performance on the current state-of-the-art deep learning algorithms and encodes multi-scale features with local and global discriminative structures which are expected to improve the performance.
Cross-scale Attention Model for Acoustic Event Classification
A cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation, is proposed, which can effectively improve the performance of current state-of-the-art deep learning algorithms.
Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging
A novel pooling algorithm is proposed for MIL, named gated multi-head attention pooling (GMAP), which is able to attend to the information of events from different heads at different positions, and increases the modeling power of the single- head attention with no computational overhead.
A Two-student Learning Framework for Mixed Supervised Target Sound Detection
A novel two-student learning framework is proposed, which contains two mutual helping student models that learn from fully- and weakly-annotated datasets, respectively, which learns novel categories using weak annotations with the help of full annotations of existing base categories (source domain).
Polyphonic Sound Event Detection with Weak Labeling
This thesis proposes to train deep learning models for SED using various levels of weak labeling, and shows that the sound events can be learned and localized by a recurrent neural network (RNN) with a connectionist temporal classification (CTC) output layer, which is well suited for sequential supervision.
Specialized Decision Surface and Disentangled Feature for Weakly-Supervised Polyphonic Sound Event Detection
Experiments show that the proposed SDS and DF significantly improve the detection performance of the embedding-level MIL approach with an attention pooling module and outperform the first place system in the challenge by $\mathbf {6.6}$ percentage points.


Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
Audio Event Detection using Weakly Labeled Data
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
In this paper, we describe our contribution to the challenge of detection and classification of acoustic scenes and events (DCASE2017). We propose framCNN, a novel weakly-supervised learning
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
This work combines these two approaches in a convolutional recurrent neural network (CRNN) and applies it on a polyphonic sound event detection task and observes a considerable improvement for four different datasets consisting of everyday sound events.
YouTube-8M: A Large-Scale Video Classification Benchmark
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.
Very Deep Convolutional Networks for Large-Scale Image Recognition
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Non-speech audio event detection
This paper describes experiments with SVM and HMM-based classifiers, using a 290-hour corpus of sound effects, and reports promising results, despite the difficulties posed by the mixtures of audio events that characterize real sounds.
Sound event detection using non-negative dictionaries learned from annotated overlapping events
  • O. Dikmen, A. Mesaros
  • Computer Science
    2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
  • 2013
This paper proposes a method which bypasses the need to build separate sound models and learns non-negative dictionaries for the sound content and their annotations in a coupled sense, and very promising results are obtained using only a small amount of data.