Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

  title={Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking},
  author={Eduardo Fonseca and Shawn Hershey and Manoj Plakal and Daniel P. W. Ellis and Aren Jansen and R. Channing Moore},
  journal={IEEE Signal Processing Letters},
The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the… 

Figures and Tables from this paper

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

  • Yuan GongJingbo YuJames R. Glass
  • Computer Science, Physics
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
A VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects is created to support research on building robust and accurate vocal sound recognition.

Semi-Supervised Audio Classification with Partially Labeled Data

This paper presents two semi-supervised methods capable of learning with missing labels and evaluates them on two publicly available, partially labeled datasets.

FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.

PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation

PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.

PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation

PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.

Symptom Identification for Interpretable Detection of Multiple Mental Disorders

Mental disease detection (MDD) from social media has suffered from poor generalizability and interpretability, due to lack of symptom modeling. This paper introduces PsySym , the first annotated

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.

Enriching Ontology with Temporal Commonsense for Low-Resource Audio Tagging

This work investigates robust audio tagging models in low-resource scenarios with the enhancement of knowledge graphs and proposes a semi-automatic approach that can construct temporal knowledge graphs on diverse domain-specific label sets.

Sound Event Detection: A tutorial

Imagine standing on a street corner in the city. With your eyes closed you can hear and recognize a succession of sounds: cars passing by, people speaking, their footsteps when they walk by, and the


This paper evaluates two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps, and shows that these modifications consistently improve sound event classification in all cases considered, without adding any (or adding very few) trainable parameters, which makes them an appealing alternative to conventional pooling layers.



MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

SeCoST: Sequential Co-Supervision for Weakly Labeled Audio Event Detection

Confident Learning: Estimating Uncertainty in Dataset Labels

This work combines the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels, and presents a generalized CL which is provably consistent and experimentally performant.

The Impact of Missing Labels and Overlapping Sound Events on Multi-label Multi-instance Learning for Sound Event Classification

This paper investigates two state-of-theart methodologies that allow this type of learning, low-resolution multi-label non-negative matrix deconvolution (LRM-NMD) and CNN and shows good robustness to missing labels.

Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers

This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.

A Deep Residual Network for Large-Scale Acoustic Scene Analysis

The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.

Sound Event Detection Using Point-Labeled Data

  • B. KimB. Pardo
  • Computer Science
    2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
This work illustrates methods to train a SED model on point-labeled data and shows that a model trained on point labeled audio data significantly outperforms weak models and is comparable to a modeltrained on strongly labeled data.

Audio tagging with noisy labels and minimal supervision

This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.