Audio Tagging by Cross Filtering Noisy Labels

@article{Zhu2020AudioTB,
  title={Audio Tagging by Cross Filtering Noisy Labels},
  author={Boqing Zhu and Kele Xu and Qiuqiang Kong and Huaimin Wang and Yuxing Peng},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2020},
  volume={28},
  pages={2073-2083}
}
  • Boqing Zhu, Kele Xu, Yuxing Peng
  • Published 16 July 2020
  • Computer Science
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In this article, we present a novel framework, named CrossFilter, to combat the noisy labels problem… 
ARCA23K: An audio dataset for investigating open-set label noise
TLDR
It is shown that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and this type of label noise is referred to as open-set label noise.
Polyphonic training set synthesis improves self-supervised urban sound classification.
TLDR
A two-stage approach to pre-train audio classifiers on a task whose ground truth is trivially available to benefit overall performance more than self-supervised learning and the geographical origin of the acoustic events in training set synthesis appears to have a decisive impact.
Audio Tagging Using CNN Based Audio Neural Networks for Massive Data Processing
TLDR
A large-scale audio dataset is used for training a pre-trained audio neural network that outperforms the existing systems with a mean average of 0.45 and the performance of the proposed model is demonstrated by applying theaudio neural network to five specific audio pattern recognition tasks.
Self-Supervised Learning from Automatically Separated Sound Scenes
TLDR
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.
Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism
TLDR
A novel multimodal deep learning framework for the popularity prediction task, which aims to leverage the complementary knowledge from different modalities, is proposed and results show that the proposed framework outperforms related approaches.
Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions for Beauty Product Retrieval
TLDR
This paper proposes a novel descriptors, named Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions (MS-GRMAC), which introduces multi-scale generalized attention mechanism to reduce the influence of scale variations, thus, can boost the performance of the retrieval task.

References

SHOWING 1-10 OF 52 REFERENCES
Audio tagging with noisy labels and minimal supervision
TLDR
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
General-purpose audio tagging from noisy labels using convolutional neural networks
TLDR
A system using an ensemble of convolutional neural networks trained on log-scaled mel spectrograms to address general-purpose audio tagging challenges and to reduce the effects of label noise is proposed.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Iterative Learning with Open-set Noisy Labels
TLDR
A novel iterative learning framework for training CNNs on datasets with open-set noisy labels that detects noisy labels and learns deep discriminative features in an iterative fashion and designs a Siamese network to encourage clean labels and noisy labels to be dissimilar.
Label-efficient audio classification through multitask learning and self-supervision
TLDR
This work trains an end-to-end audio feature extractor based on WaveNet that feeds into simple, yet versatile task-specific neural networks and describes several easily implemented self-supervised learning tasks that can operate on any large, unlabeled audio corpus.
Learning from Noisy Large-Scale Datasets with Minimal Supervision
TLDR
An approach to effectively use millions of images with noisy annotations in conjunction with a small subset of cleanly-annotated images to learn powerful image representations and is particularly effective for a large number of classes with wide range of noise in annotations.
Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels
TLDR
A theoretically grounded set of noise-robust loss functions that can be seen as a generalization of MAE and CCE are presented and can be readily applied with any existing DNN architecture and algorithm, while yielding good performance in a wide range of noisy label scenarios.
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging
TLDR
This paper describes the approach to the DCASE 2019 challenge Task 2: Audio tagging with noisy labels and minimal supervision, a multi-label audio classification with 80 classes, and proposes three strategies, including multitask learning using noisy data and labels that are relabeled using trained models’ predictions.
Training Deep Neural Networks on Noisy Labels with Bootstrapping
TLDR
A generic way to handle noisy and incomplete labeling by augmenting the prediction objective with a notion of consistency is proposed, which considers a prediction consistent if the same prediction is made given similar percepts, where the notion of similarity is between deep network features computed from the input data.
...
...