Weakly Labelled AudioSet Tagging With Attention Neural Networks

  title={Weakly Labelled AudioSet Tagging With Attention Neural Networks},
  author={Qiuqiang Kong and Changsong Yu and Yong Xu and Turab Iqbal and Wenwu Wang and Mark D. Plumbley},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognizing a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. AudioSet is weakly labelled, in that only the presence or absence of sound classes is known for each clip, whereas the onset and offset times are unknown… 
A Global-Local Attention Framework for Weakly Labelled Audio Tagging
  • Helin Wang, Yuexian Zou, Wenwu Wang
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A novel two-stream framework for audio tagging by exploiting the global and local information of sound events and can significantly improve the performance of audio tagging under different baseline network architectures.
Audio Captioning Transformer
An Audio Captioning Transformer (ACT) is proposed, which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free, which has a better ability to model the global information within an audio signal as well as capture temporal relationships between audio events.
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments, selected from the Google AudioSet dataset.
Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization
A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN).
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation
PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.
Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention
A novel attention mechanism, namely, spatial and channel-wise attention (SCA), that can be employed into any CNNs seamlessly with affordable overheads and is end-to-end trainable fashion is proposed.
Modeling Label Dependencies for Audio Tagging With Graph Convolutional Network
This work proposes to model the label dependencies via a graph-based method, where each node of the graph represents a label, and achieves a state-of-the-art mean average precision (mAP) of 0.434.
SeCoST:: Sequential Co-Supervision for Large Scale Weakly Labeled Audio Event Detection
  • Anurag Kumar, V. Ithapu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new framework for designing learning models with weak supervision by bridging ideas from sequential learning and knowledge distillation is proposed, referred to as SeCoST (pronounced Sequest) — Sequential Co-supervision for training generations of Students.


Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
Multi-level Attention Model for Weakly Supervised Audio Classification
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
A joint detection-classification model for audio tagging of weakly labelled data
This work proposes a joint detection-classification (JDC) model to detect and classify the audio clip simultaneously and shows that the JDC model reduces the equal error rate (EER) from 19.0% to 16.9%.
Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
Learning to Recognize Transient Sound Events using Attentional Supervision
This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV), establishing a new state-of-theart for DCASE17 and AudioSet data sets.
Audio Event Detection using Weakly Labeled Data
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling
This paper builds a neural network called TALNet, which is the first system to reach state-of-the-art audio tagging performance on Audio Set, while exhibiting strong localization performance on the DCASE 2017 challenge at the same time.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Bag-of-Audio-Words Approach for Multimedia Event Classification
Variations of the BoAW method are explored and results on NIST 2011 multimedia event detection (MED) dataset are presented.
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.