Staged Training Strategy and Multi-Activation for Audio Tagging with Noisy and Sparse Multi-Label Data

@article{He2020StagedTS,
  title={Staged Training Strategy and Multi-Activation for Audio Tagging with Noisy and Sparse Multi-Label Data},
  author={Ke-Xin He and Yu-Han Shen and Weiqiang Zhang and Jia Liu},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={631-635}
}
  • Ke-Xin He, Yu-Han Shen, Jia Liu
  • Published 1 May 2020
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Audio tagging aims to predict whether certain acoustic events occur in the audio clips. Due to the difficulty and huge cost of obtaining manually labeled data with high confidence, researchers begin to focus on audio tagging using a small set of manually-labeled data, and a larger set of noisy-labeled data. In addition, audio tagging is a sparse multi-label classification task, where only a small number of acoustic events may occur in an audio clip. In this paper, we propose a staged training… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 20 REFERENCES
Multiple Neural Networks with Ensemble Method for Audio Tagging with Noisy Labels and Minimal Supervision
TLDR
This system uses a sigmoid-softmax activation to deal with so-called sparse multi-label classification and an ensemble method that averages models learned with multiple neural networks and various acoustic features to achieve labelweighted label-ranking average precision scores.
Audio tagging with noisy labels and minimal supervision
TLDR
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
TLDR
A weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events and the attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames.
MULTITASK LEARNING AND SEMI-SUPERVISED LEARNING WITH NOISY DATA FOR AUDIO TAGGING Technical Report
TLDR
This paper's submission to the DCASE 2019 challenge Task 2 "Audio tagging with noisy labels and minimal supervision" is described, which achieves a score of 0.750 with labelweighted label-ranking average precision (lwlrap).
General-purpose audio tagging from noisy labels using convolutional neural networks
TLDR
A system using an ensemble of convolutional neural networks trained on log-scaled mel spectrograms to address general-purpose audio tagging challenges and to reduce the effects of label noise is proposed.
Convolutional Recurrent Neural Network and Data Augmentation for Audio Tagging with Noisy Labels and Minimal Supervision
TLDR
This paper proposes a model consisting of a convolutional front end using log-mel-energies as input features, a recurrent neural network sequence encoder and a fully connected classifier network outputting an activity probability for each of the 80 considered event classes.
Audio Event Detection using Weakly Labeled Data
TLDR
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
Training general-purpose audio tagging networks with noisy labels and iterative self-verification
This paper describes our submission to the first Freesound generalpurpose audio tagging challenge carried out within the DCASE 2018 challenge. Our proposal is based on a fully convolutional neural
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
...
1
2
...