• Corpus ID: 51723509

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

  title={General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline},
  author={Eduardo Fonseca and Manoj Plakal and Frederic Font and Daniel P. W. Ellis and Xavier Favory and Jordi Pons and Xavier Serra},
This paper describes Task 2 of the DCASE 2018 Challenge, titled "General-purpose audio tagging of Freesound content with AudioSet labels". This task was hosted on the Kaggle platform as "Freesound General-Purpose Audio Tagging Challenge". The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology. We present the task, the dataset prepared for the competition, and a baseline… 

Figures and Tables from this paper

Audio tagging with noisy labels and minimal supervision
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
DCASE 2018 task 2: iterative training, label smoothing, and background noise normalization for audio event tagging
This paper describes an approach from the submissions for DCASE 2018 Task 2: general-purpose audio tagging of Freesound content with AudioSet labels, and proposes to use pseudolabel for automatic label verification and label smoothing to reduce the over-fitting.
This technical report proposes the model architectures which can efficiently tag the audio with multi-label and noisy label based on convolutional network and recurrent network to unify detection of audio events.
The Aalto system based on fine-tuned AudioSet features for DCASE2018 task2 - general purpose audio tagging
A neural network system for DCASE 2018 task 2, general purpose audio tagging is presented, which out-performs the baseline result of 0.704 and achieves top 8% in the public leaderboard.
General-purpose audio tagging by ensembling convolutional neural networks based on multiple features
This paper describes an audio tagging system that participated in Task 2 “General-purpose audio tagging of Freesound content with AudioSet labels” of the “Detection and Classification of Acoustic
General-purpose audio tagging from noisy labels using convolutional neural networks
A system using an ensemble of convolutional neural networks trained on log-scaled mel spectrograms to address general-purpose audio tagging challenges and to reduce the effects of label noise is proposed.
Meta learning based audio tagging
This paper describes the solution for the general-purpose audio tagging task, which belongs to one of the subtasks in the DCASE 2018 challenge, and proposes a meta learning-based ensemble method that can provide higher prediction accuracy and robustness with comparison to the single model.
Weakly Labelled AudioSet Tagging With Attention Neural Networks
This work bridges the connection between attention neural networks and multiple instance learning (MIL) methods, and proposes decision-level and feature-level attention neural Networks for audio tagging, which achieves a state-of-the-art mean average precision.
FSD50K: An Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
General audio tagging with ensembling convolutional neural network and statistical features
An ensemble learning framework is applied to ensemble statistical features and the outputs from the deep classifiers, with the goal to utilize complementary information to address the noisy label problem within the framework.


Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated and a relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.
Freesound Datasets: A Platform for the Creation of Open Audio Datasets
Comunicacio presentada al 18th International Society for Music Information Retrieval Conference celebrada a Suzhou, Xina, del 23 al 27 d'cotubre de 2017.
Chime-home: A dataset for sound source recognition in a domestic environment
The annotation approach associates each 4-second excerpt from the audio recordings with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment, to obtain a representation of `ground truth' in annotations.
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
The emergence of deep learning as the most popular classification method is observed, replacing the traditional approaches based on Gaussian mixture models and support vector machines.
Freesound technical demo
This demo wants to introduce Freesound to the multimedia community and show its potential as a research resource.
SoundNet: Learning Sound Representations from Unlabeled Video
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.