Semi-Supervised Audio Classification with Partially Labeled Data

  title={Semi-Supervised Audio Classification with Partially Labeled Data},
  author={Siddharth Gururani and Alexander Lerch},
  journal={2021 IEEE International Symposium on Multimedia (ISM)},
Audio classification has seen great progress with the increasing availability of large-scale datasets. These large datasets, however, are often only partially labeled as collecting full annotations is a tedious and expensive process. This paper presents two semi-supervised methods capable of learning with missing labels and evaluates them on two publicly available, partially labeled datasets. The first method relies on label enhancement by a two-stage teacher-student learning process, while the… 
1 Citations

Figures and Tables from this paper

Symptom Identification for Interpretable Detection of Multiple Mental Disorders
Mental disease detection (MDD) from social media has suffered from poor generalizability and interpretability, due to lack of symptom modeling. This paper introduces PsySym , the first annotated


Semi-Supervised Audio Classification with Consistency-Based Regularization
This paper incorporates audio-specific perturbations into the Mean Teacher algorithm and demonstrates the effectiveness of the resulting method on audio classification tasks.
DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging
This paper describes the approach to the DCASE 2019 challenge Task 2: Audio tagging with noisy labels and minimal supervision, a multi-label audio classification with 80 classes, and proposes three strategies, including multitask learning using noisy data and labels that are relabeled using trained models’ predictions.
Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning
An unsupervised loss function is proposed that takes advantage of the stochastic nature of these methods and minimizes the difference between the predictions of multiple passes of a training sample through the network.
Semi-supervised learning using teacher-student models for vocal melody extraction
The results show that the SSL method significantly increases the performance against supervised learning only and the improvement depends on the teacher-student models, the size of unlabeled data, the number of self-training iterations, and other training details.
Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks, but it becomes unwieldy when learning large datasets, so Mean Teacher, a method that averages model weights instead of label predictions, is proposed.
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
The Impact of Missing Labels and Overlapping Sound Events on Multi-label Multi-instance Learning for Sound Event Classification
This paper investigates two state-of-theart methodologies that allow this type of learning, low-resolution multi-label non-negative matrix deconvolution (LRM-NMD) and CNN and shows good robustness to missing labels.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.