FSD50K: An Open Dataset of Human-Labeled Sound Events

@article{Fonseca2022FSD50KAO,
  title={FSD50K: An Open Dataset of Human-Labeled Sound Events},
  author={Eduardo Fonseca and Xavier Favory and Jordi Pons and Frederic Font and Xavier Serra},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2022},
  volume={30},
  pages={829-852}
}
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2 M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset… 
ARCA23K: An audio dataset for investigating open-set label noise
TLDR
It is shown that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and this type of label noise is referred to as open-set label noise.
GISE-51: A scalable isolated sound events dataset
TLDR
This work introduces GISE-51, a dataset spanning 51 isolated sound events belonging to a broad domain of event types, providing an open, reproducible benchmark for future research along with the freedom to adapt the included isolatedsound events for domain-specific applications.
Who Calls The Shots? Rethinking Few-Shot Learning for Audio
TLDR
A series of experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size- fits-all model, method, and support set selection criterion, and it depends on the expected application scenario.
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition
TLDR
A VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects is created to support research on building robust and accurate vocal sound recognition.
CLAP: Learning Audio Concepts From Natural Language Supervision
TLDR
Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space, and generalizes to multiple downstream tasks.
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks
TLDR
This paper formalizes this task as the cocktail fork problem, and presents the Divide and Remaster dataset to foster research on this topic, and introduces a new mixed-STFT-resolution model to better address the variety of acoustic characteristics of the three source types.
MetaAudio: A Few-Shot Audio Classification Benchmark
TLDR
This work carries out in-depth analyses of joint training and cross-dataset adaptation protocols, establishing the possibility of a generalised audio few-shot classification algorithm and shows gradient-based meta-learning methods such as MAML and Meta-Curvature consistently outperform both metric and baseline methods.
Pseudo strong labels for large scale weakly supervised audio tagging
TLDR
This work proposes pseudo strong labels (PSL), a simple label augmentation framework that enhances the supervision quality for large-scale weakly supervised audio tagging and reveals that PSL mitigates missing labels.
DASEE A Synthetic Database of Domestic Acoustic Scenes and Events in Dementia Patients Environment
TLDR
This work details its approach on generating an unbiased synthetic domestic audio database, consisting of sound scenes and events, emulated in both quiet and noisy environments, and presents an 11-class database containing excerpts of clean and noisy signals.
Wav2CLIP: Learning Robust Audio Representations From CLIP
TLDR
Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pretrain than competing methods as it does not require learning a visual model in concert with an auditory model.
...
...

References

SHOWING 1-10 OF 136 REFERENCES
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Mt-Gcn For Multi-Label Audio Tagging With Noisy Labels
TLDR
MT-GCN is presented, a Multi-task Learning based Graph Convolutional Network that learns domain knowledge from ontology that outperforms the baseline methods by a significant margin.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Vggsound: A Large-Scale Audio-Visual Dataset
TLDR
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.
A Closer Look at Weak Label Learning for Audio Events
TLDR
This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.
Audio tagging with noisy labels and minimal supervision
TLDR
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Chime-home: A dataset for sound source recognition in a domestic environment
TLDR
The annotation approach associates each 4-second excerpt from the audio recordings with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment, to obtain a representation of `ground truth' in annotations.
The Benefit of Temporally-Strong Labels in Audio Event Classification
  • Shawn Hershey, D. Ellis, M. Plakal
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
It is shown that fine-tuning with a mix of weak- and strongly-labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels.
Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking
TLDR
This work proposes a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process, finding that a simple optimisation of the training label set improves recognition performance without additional computation.
Large-scale audio event discovery in one million YouTube videos
TLDR
This work performs an unprecedented exploration into the large-scale discovery of recurring audio events in a diverse corpus of one million YouTube videos to apply a streaming, nonparametric clustering algorithm to both spectral features and out-of-domain neural audio embeddings.
...
...