• Corpus ID: 221140130

Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection

  title={Multi-Task Learning for Interpretable Weakly Labelled Sound Event Detection},
  author={Soham Deshmukh and Bhiksha Raj and Rita Singh},
Weakly Labelled learning has garnered lot of attention in recent years due to its potential to scale Sound Event Detection (SED) and is formulated as Multiple Instance Learning (MIL) problem. This paper proposes a Multi-Task Learning (MTL) framework for learning from Weakly Labelled Audio data which encompasses the traditional MIL setup. To show the utility of proposed framework, we use the input TimeFrequency representation (T-F) reconstruction as the auxiliary task. We show that the chosen… 

Figures and Tables from this paper

Attentive Max Feature Map for Acoustic Scene Classification with Joint Learning considering the Abstraction of Classes
This work proposes a mechanism referred to as the attentivemax feature map which combines two effective techniques, attention and max feature map, to further elaborate the attention mechanism and mitigate the abovementioned phenomenon.
Interpreting Glottal Flow Dynamics for Detecting Covid-19 From Voice
A method that analyzes the differential dynamics of the glottal flow waveform during voice production to identify features in them that are most significant for the detection of COVID-19 from voice to infer their potential as discriminative features for classification.
A Multi-Modal Respiratory Disease Exacerbation Prediction Technique Based on a Spatio-Temporal Machine Learning Architecture
A multi-modal solution for predicting the exacerbation risks of respiratory diseases, such as COPD, based on a novel spatio-temporal machine learning architecture for real-time and accurate respiratory events detection, and tracking of local environmental and meteorological data and trends is presented.
MusicNet: Compact Convolutional Neural Network for Real-time Background Music Detection
MusicNet is a compact neural model for detecting background music in the real-time communications pipeline and has a true positive rate (TPR) of 81.3% at a 0.1% false positive rate, which is better than state-of-the-art models included in the study.


Multiple Instance Deep Learning for Weakly Supervised Audio Event Detection
It is shown that audio embeddings extracted by the convolutional neural networks significantly boost the performance of all MIL models, and this framework reduces the model complexity of the AED system and is suitable for applications where computational resources are limited.
Adaptive Pooling Operators for Weakly Labeled Sound Event Detection
This paper treats SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality, and develops a family of adaptive pooling operators—referred to as autopool—which smoothly interpolate between common pooling Operators, and automatically adapt to the characteristics of the sound sources in question.
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
Audio Event Detection using Weakly Labeled Data
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
A Deep Residual Network for Large-Scale Acoustic Scene Analysis
The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.
Weakly-supervised audio event detection using event-specific Gaussian filters and fully convolutional networks
A model based on convolutional neural networks that relies only on weakly-supervised data for training and is able to detect frame-level information, e.g., the temporal position of sounds, even when it is trained merely with clip-level labels.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data
A time–frequency (T–F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem is proposed and predicted onset and offset times can be obtained from the T–F segmentation masks.