• Corpus ID: 3746911

Multi-level Attention Model for Weakly Supervised Audio Classification

  title={Multi-level Attention Model for Weakly Supervised Audio Classification},
  author={Changsong Yu and Karim Said Barsim and Qiuqiang Kong and Binh Yang},
In this paper, we propose a multi-level attention model for the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or the absence of sound events in an audio clip. Recently, Google published a large scale weakly labelled AudioSet dataset containing 2 million audio clips with only the presence or the absence labels of the sound events, without the onset and offset time of the sound events. Previously proposed attention models… 

Figures and Tables from this paper

Weakly Labelled AudioSet Tagging With Attention Neural Networks
This work bridges the connection between attention neural networks and multiple instance learning (MIL) methods, and proposes decision-level and feature-level attention neural Networks for audio tagging, which achieves a state-of-the-art mean average precision.
A Global-Local Attention Framework for Weakly Labelled Audio Tagging
  • Helin WangYuexian ZouWenwu Wang
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A novel two-stream framework for audio tagging by exploiting the global and local information of sound events and can significantly improve the performance of audio tagging under different baseline network architectures.
Cross-scale Attention Model for Acoustic Event Classification
A cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation, is proposed, which can effectively improve the performance of current state-of-the-art deep learning algorithms.
Multi-Level Fusion based Class-aware Attention Model for Weakly Labeled Audio Tagging
A novel end-to-end multi-level attention model that first makes segment-level predictions with temporal modeling, followed by advanced aggregations along both time and feature domains and introduces a weight sharing strategy to reduce model complexity and overfitting is presented.
A Deep Residual Network for Large-Scale Acoustic Scene Analysis
The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.
Deep progressive multi-scale attention for acoustic event classification
The proposed MSA model effectively improved the performance on the current state-of-the-art deep learning algorithms and encodes multi-scale features with local and global discriminative structures which are expected to improve the performance.
Gated Multi-Head Attention Pooling for Weakly Labelled Audio Tagging
A novel pooling algorithm is proposed for MIL, named gated multi-head attention pooling (GMAP), which is able to attend to the information of events from different heads at different positions, and increases the modeling power of the single- head attention with no computational overhead.
An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training Strategy
An end-to-end audio classification system based on raw waveforms and a mix-training strategy to break the performance limitation caused by the amount of training data and exceeds the state-of-the-art multi-level attention model.
A Multi-Channel Temporal Attention Convolutional Neural Network Model for Environmental Sound Classification
An effective convolutional neural network structure with a multichannel temporal attention (MCTA) block, which applies a temporal attention mechanism within each channel of the embedded features to extract channel-wise relevant temporal information.
Learning Hierarchy Aware Embedding From Raw Audio for Acoustic Scene Classification
  • V. AbrolPulkit Sharma
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
This work proposes a raw waveform based end-to-end ASC system using convolutional neural network that leverages the hierarchical relations between acoustic categories to improve the classification performance and uses a prototypical model.


Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging
The experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset.
Audio Event Detection using Weakly Labeled Data
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
A Multi-level Weighted Representation for Person Re-identification
A multi-level weighted representation for person re-identification is proposed, in which features containing strong discriminative powers or rich semantic meanings are extracted from different layers of a deep CNN, and an estimation subnet evaluates the quality of each feature and generates quality scores used as concatenation weights for all multi- level features.
Deep Neural Network Baseline for DCASE Challenge 2016
The DCASE Challenge 2016 contains tasks for Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), and audio tagging, and DNN baselines indicate that DNNs can be successful in many of these tasks, but may not always perform better than the baselines.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
Automatic Tagging Using Deep Convolutional Neural Networks
The experiments show that mel-spectrogram is an effective time-frequency representation for automatic tagging and that more complex models benefit from more training data.