DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging

  title={DCASE 2019 Task 2: Multitask Learning, Semi-supervised Learning and Model Ensemble with Noisy Data for Audio Tagging},
  author={Osamu Akiyama and Junya Sato},
This paper describes our approach to the DCASE 2019 challenge Task 2: Audio tagging with noisy labels and minimal supervision. This task is a multi-label audio classification with 80 classes. The training data is composed of a small amount of reliably labeled data (curated data) and a larger amount of data with unreliable labels (noisy data). Additionally, there is a difference in data distribution between curated data and noisy data. To tackle these difficulties, we propose three strategies… 

Figures and Tables from this paper

Audio Tagging by Cross Filtering Noisy Labels
This article presents a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging, and achieves state-of-the-art performance and even surpasses the ensemble models on FSDKaggle2018 dataset.
Semi-Supervised Audio Classification with Partially Labeled Data
This paper presents two semi-supervised methods capable of learning with missing labels and evaluates them on two publicly available, partially labeled datasets.
Urban Sound Tagging using Convolutional Neural Networks
It is shown that using pre-trained image classification models along with the usage of data augmentation techniques results in higher performance over alternative approaches.
In this technical report, we address the UOS submission for the Detection and Classification of Acoustic Scenes and Events 2020 Challenge Task 1-a. We propose to utilize the representation vectors,
DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events
The aim is to build an integrated system that can serve as a pretrained model to perform the three abovementioned tasks, and demonstrates that the proposed architecture, called DcaseNet, can be either directly used for any of the tasks while providing suitable results or fine-tuned to improve the performance of one task.
DCASENET: A joint pre-trained deep neural network for detecting and classifying acoustic scenes and events
This study proposes an integrated deep neural network that can perform three tasks: acoustic scene classification, audio tagging, and sound event detection and shows that the proposed system, DCASENet, itself can be directly used for any tasks with competitive results, or it can be further finetuned for the target task.
Multimodal Urban Sound Tagging with Spatiotemporal Context
A multimodal UST system that deeply mines the audio and spatiotemporal context together, and a data filtering approach is adopted in text processing to further improve the performance of multimodality.


Audio tagging with noisy labels and minimal supervision
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Label-efficient audio classification through multitask learning and self-supervision
This work trains an end-to-end audio feature extractor based on WaveNet that feeds into simple, yet versatile task-specific neural networks and describes several easily implemented self-supervised learning tasks that can operate on any large, unlabeled audio corpus.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks, but it becomes unwieldy when learning large datasets, so Mean Teacher, a method that averages model weights instead of label predictions, is proposed.
Realistic Evaluation of Deep Semi-Supervised Learning Algorithms
This work creates a unified reimplemention and evaluation platform of various widely-used SSL techniques and finds that the performance of simple baselines which do not use unlabeled data is often underreported, that SSL methods differ in sensitivity to the amount of labeled and unlabeling data, and that performance can degrade substantially when the unlabelED dataset contains out-of-class examples.
MixMatch: A Holistic Approach to Semi-Supervised Learning
This work unify the current dominant approaches for semi-supervised learning to produce a new algorithm, MixMatch, that works by guessing low-entropy labels for data-augmented unlabeled examples and mixing labeled and unlabeling data using MixUp.
Distilling the Knowledge in a Neural Network
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks
This simple and efficient method of semi-supervised learning for deep neural networks is proposed, trained in a supervised fashion with labeled and unlabeled data simultaneously and favors a low-density separation between classes.
Learning from Between-class Examples for Deep Sound Recognition
The experimental results show that BC learning improves the performance on various sound recognition networks, datasets, and data augmentation schemes, in which BC learning proves to be always beneficial.
Dropout: a simple way to prevent neural networks from overfitting
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Snapshot Ensembles: Train 1, get M for free
This paper proposes a method to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost by training a single neural network, converging to several local minima along its optimization path and saving the model parameters.