On Learning Disentangled Representation for Acoustic Event Detection

  title={On Learning Disentangled Representation for Acoustic Event Detection},
  author={L. Gao and Qirong Mao and M. Dong and Yu Jing and Ratna Babu Chinnam},
  journal={Proceedings of the 27th ACM International Conference on Multimedia},
  • L. Gao, Qirong Mao, R. Chinnam
  • Published 15 October 2019
  • Computer Science
  • Proceedings of the 27th ACM International Conference on Multimedia
Polyphonic Acoustic Event Detection (AED) is a challenging task as the sounds are mixed with the signals from different events, and the features extracted from the mixture do not match well with features calculated from sounds in isolation, leading to suboptimal AED performance. In this paper, we propose a supervised β-VAE model for AED, which adds a novel event-specific disentangling loss in the objective function of disentangled learning. By incorporating either latent factor blocks or latent… 

Figures and Tables from this paper

Reproducibility Companion Paper: On Learning Disentangled Representation for Acoustic Event Detection
This companion paper is provided to describe the major experiments reported in our paper "On Learning Disentangled Representation for Acoustic Event Detection" published in ACM Multimedia 2019. To
Learning to disentangle emotion factors for facial expression recognition in the wild
This paper proposes an end‐to‐end module to disentangle latent emotion discriminative factors from the complex factors variables for FER to obtain salient emotion features and shows that this approach has remarkable performance in complex scenes than current state‐of‐the‐art methods.
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding
Evidence that objects in segmented natural movies undergo transitions that are typically small in magnitude with occasional large jumps, which is characteristic of a temporally sparse distribution is provided and SlowVAE, a model for unsupervised representation learning that uses a sparse prior on temporally adjacent observations to disentangle generative factors without any assumptions on the number of changing factors is presented.


A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification
This paper training two variants of SoundNet, a deep convolutional network that takes the audio tracks of videos as the input, and tries to approximate the visual information extracted by an image recognition network, to introduce knowledge learned from a much larger corpus into the CTC network.
Virtual Adversarial Training and Data Augmentation for Acoustic Event Detection with Gated Recurrent Neural Networks
Data augmentation such as on-the-fly shuffling and virtual adversarial training for regularization of the GRNNs improve the performance of gated recurrent neural networks in acoustic event detection task.
Recurrent neural networks for polyphonic sound event detection in real life recordings
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
Deep Neural Network Bottleneck Features for Acoustic Event Recognition
This paper proposes a novel acoustic event recognition framework using bottleneck features derived from a Deep Neural Network (DNN), and employs rhythm, timbre, and spectrum-statistics features for effectively extracting acoustic characteristics from audio signals.
Disentangled Sequential Autoencoder
Empirical evidence is given for the hypothesis that stochastic RNNs as latent state models are more efficient at compressing and generating long sequences than deterministic ones, which may be relevant for applications in video compression.
Frame-Wise Dynamic Threshold Based Polyphonic Acoustic Event Detection
Two novel approaches, namely contour and regressor based dynamic threshold approaches are proposed in this work and demonstrated the superior performance of the proposed approaches on the popular TUT Acoustic Scenes 2016 database of polyphonic events.
Polyphonic sound event detection using multi label deep neural networks
Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
Learning an interpretable factorised representation of the independent data generative factors of the world without supervision is an important precursor for the development of artificial
A report on sound event detection with different binaural features
Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset and seen to consistently perform equal to or better than the single-channel features with respect to error rate metric.