Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

@article{Fonseca2020AddressingML,
  title={Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking},
  author={Eduardo Fonseca and Shawn Hershey and Manoj Plakal and Daniel P. W. Ellis and Aren Jansen and R. Channing Moore},
  journal={IEEE Signal Processing Letters},
  year={2020},
  volume={27},
  pages={1235-1239}
}
The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the… 

Figures and Tables from this paper

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition
TLDR
A VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects is created to support research on building robust and accurate vocal sound recognition.
Semi-Supervised Audio Classification with Partially Labeled Data
TLDR
This paper presents two semi-supervised methods capable of learning with missing labels and evaluates them on two publicly available, partially labeled datasets.
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.
FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
TLDR
An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.
Enriching Ontology with Temporal Commonsense for Low-Resource Audio Tagging
TLDR
This work investigates robust audio tagging models in low-resource scenarios with the enhancement of knowledge graphs and proposes a semi-automatic approach that can construct temporal knowledge graphs on diverse domain-specific label sets.
Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks
TLDR
This paper evaluates two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps, and shows that these modifications consistently improve sound event classification in all cases considered, without adding any (or adding very few) trainable parameters, which makes them an appealing alternative to conventional pooling layers.
J ul 2 02 1 IMPROVING SOUND EVENT CLASSIFICATION BY INCREASING SHIFT INVARIANCE IN CONVOLUTIONAL NEURAL NETWORKS
TLDR
This paper evaluates two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps, and shows that these modifications consistently improve sound event classification in all cases considered, without adding any (or adding very few) trainable parameters, which makes them an appealing alternative to conventional pooling layers.
Self-Supervised Learning from Automatically Separated Sound Scenes
TLDR
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.
...
1
2
...

References

SHOWING 1-10 OF 30 REFERENCES
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
TLDR
This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Confident Learning: Estimating Uncertainty in Dataset Labels
TLDR
This work combines building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels, resulting in a generalized CL which is provably consistent and experimentally performant.
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
TLDR
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
A Deep Residual Network for Large-Scale Acoustic Scene Analysis
TLDR
The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.
Audio tagging with noisy labels and minimal supervision
TLDR
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Learning Sound Events From Webly Labeled Data
TLDR
This work introduces webly labeled learning for sound events which aims to remove human supervision altogether from the learning process, and develops a method of obtaining labeled audio data from the web, in which no manual labeling is involved.
Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers
TLDR
This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.
SeCoST: Sequential Co-Supervision for Weakly Labeled Audio Event Detection
...
1
2
3
...