Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging

  title={Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging},
  author={Yong Xu and Qiuqiang Kong and Qiang Huang and Wenwu Wang and Mark D. Plumbley},
Audio tagging aims to perform multi-label classification on audio chunks and it is a newly proposed task in the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge. This task encourages research efforts to better analyze and understand the content of the huge amounts of audio data on the web. The difficulty in audio tagging is that it only has a chunk-level label without a frame-level label. This paper presents a weakly supervised method to not only… 

Figures and Tables from this paper

A Comparison of Attention Mechanisms of Convolutional Neural Network in Weakly Labeled Audio Tagging

The results show that the performance of attention based on GLU is better than the performance on the SE block in CRNN for weakly labeled polyphonic audio tagging.

A Region Based Attention Method for Weakly Supervised Sound Event Detection and Classification

A novel region based attention method is proposed to further boost the representation power of the existing GLU based CRNN, which extracts region features from multi-scale sliding windows over higher convolutional layers, which are fed into an attention-based recurrent neural network.

Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network

In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly

Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention

A novel attention mechanism, namely, spatial and channel-wise attention (SCA), that can be employed into any CNNs seamlessly with affordable overheads and is end-to-end trainable fashion is proposed.

Staged Training Strategy and Multi-Activation for Audio Tagging with Noisy and Sparse Multi-Label Data

This paper proposes a staged training strategy to deal with the noisy label, and adopts a sigmoid-sparsemax multi-activation structure toDeal with the sparse multi-label classification of audio tagging.

Multi-Level Fusion based Class-aware Attention Model for Weakly Labeled Audio Tagging

A novel end-to-end multi-level attention model that first makes segment-level predictions with temporal modeling, followed by advanced aggregations along both time and feature domains and introduces a weight sharing strategy to reduce model complexity and overfitting is presented.

Cross-scale Attention Model for Acoustic Event Classification

A cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation, is proposed, which can effectively improve the performance of current state-of-the-art deep learning algorithms.

Task-Aware Mean Teacher Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection

A task-aware mean teacher method using a convolutional recurrent neural network (CRNN) with multi-branch structure to solve the SED and AT tasks differently, with results demonstrating the superiority of the proposed method on DCASE2018 challenge.

Reinforcement Learning based Neural Architecture Search for Audio Tagging

This paper proposes to use the Convolutional Recurrent Neural Network with Attention and Location (ATT-LOC) as the audio tagging model, and proposes to apply Neural Architecture Search to search for the optimal number of filters and the filter size.

Sample Mixed-Based Data Augmentation for Domestic Audio Tagging

A convolutional recurrent neural network with attention module with log-scaled mel spectrum as a baseline system is applied to audio tagging, achieving an state-of-the-art of equal error rate (EER) of 0.10 on DCASE 2016 task4 dataset with mixup approach, outperforming the baseline system without data augmentation.



Fully DNN-based Multi-label regression for audio tagging

This paper proposes to use a fully deep neural network (DNN) framework to handle the multi-label classification task in a regression way, and shows that the approach obtained a 15% relative improvement compared with the official GMM-based method of DCASE 2016 challenge.

Convolutional gated recurrent neural network incorporating spatial features for audio tagging

This paper proposes to use a convolutional neural network (CNN) to extract robust features from mel-filter banks, spectrograms or even raw waveforms for audio tagging to evaluate the proposed methods on Task 4 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge.

Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.

A joint detection-classification model for audio tagging of weakly labelled data

This work proposes a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously and shows that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%.

Audio Event Detection using Weakly Labeled Data

It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.

CQT-based Convolutional Neural Networks for Audio Scene Classification

It is shown in this paper that a ConstantQ-transformed input to a Convolutional Neural Network improves results and a parallel (graphbased) neural network architecture is proposed which captures relevant audio characteristics both in time and in frequency.

Deep Neural Network Baseline for DCASE Challenge 2016

The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging, and DNN baselines indicate that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.


The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated and a relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.