FSD50K: An Open Dataset of Human-Labeled Sound Events

@article{Fonseca2022FSD50KAO,
  title={FSD50K: An Open Dataset of Human-Labeled Sound Events},
  author={Eduardo Fonseca and Xavier Favory and Jordi Pons and Frederic Font and Xavier Serra},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2022},
  volume={30},
  pages={829-852}
}
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2 M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset… 
ARCA23K: An audio dataset for investigating open-set label noise
TLDR
It is shown that the majority of labelling errors in ARCA23K are due to out-of-vocabulary audio clips, and this type of label noise is referred to as open-set label noise.
GISE-51: A scalable isolated sound events dataset
TLDR
This work introduces GISE-51, a dataset spanning 51 isolated sound events belonging to a broad domain of event types, providing an open, reproducible benchmark for future research along with the freedom to adapt the included isolatedsound events for domain-specific applications.
Who Calls The Shots? Rethinking Few-Shot Learning for Audio
TLDR
A series of experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size- fits-all model, method, and support set selection criterion, and it depends on the expected application scenario.
The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks
TLDR
This paper formalizes this task as the cocktail fork problem, and presents the Divide and Remaster dataset to foster research on this topic, and introduces a new mixed-STFT-resolution model to better address the variety of acoustic characteristics of the three source types.
DASEE A Synthetic Database of Domestic Acoustic Scenes and Events in Dementia Patients Environment
TLDR
This work details its approach on generating an unbiased synthetic domestic audio database, consisting of sound scenes and events, emulated in both quiet and noisy environments, and presents an 11-class database containing excerpts of clean and noisy signals.
Wav2CLIP: Learning Robust Audio Representations From CLIP
TLDR
Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pretrain than competing methods as it does not require learning a visual model in concert with an auditory model.
Improving Deep-learning-based Semi-supervised Audio Tagging with Mixup
TLDR
This article adapted four recent SSL methods to the task of audio tagging and explored the benefits of using the mixup augmentation in the four algorithms, finding that in almost all cases, mixup brought significant gains.
Training Sound Event Detection on a Heterogeneous Dataset
TLDR
This work proposes to perform a detailed analysis of DCASE 2020 task 4 sound event detection baseline with regards to several aspects such as the type of data used for training, the parameters of the mean-teacher or the transformations applied while generating the synthetic soundscapes.
Learning Audio Representations with MLPs
In this paper, we propose an efficient MLP-based approach for learning audio representations, namely timestamp and scene-level audio embeddings. We use an encoder consisting of sequentially stacked
What’s all the Fuss about Free Universal Sound Separation Data?
  • Scott Wisdom, Hakan Erdogan, J. Hershey
  • Computer Science, Physics
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 136 REFERENCES
Mt-Gcn For Multi-Label Audio Tagging With Noisy Labels
TLDR
MT-GCN is presented, a Multi-task Learning based Graph Convolutional Network that learns domain knowledge from ontology that outperforms the baseline methods by a significant margin.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Vggsound: A Large-Scale Audio-Visual Dataset
TLDR
The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.
A Closer Look at Weak Label Learning for Audio Events
TLDR
This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.
Audio tagging with noisy labels and minimal supervision
TLDR
This paper presents the task setup, the FSDKaggle2019 dataset prepared for this scientific evaluation, and a baseline system consisting of a convolutional neural network.
Chime-home: A dataset for sound source recognition in a domestic environment
TLDR
The annotation approach associates each 4-second excerpt from the audio recordings with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment, to obtain a representation of `ground truth' in annotations.
The Benefit of Temporally-Strong Labels in Audio Event Classification
  • Shawn Hershey, D. Ellis, M. Plakal
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
It is shown that fine-tuning with a mix of weak- and strongly-labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels.
Large-scale audio event discovery in one million YouTube videos
TLDR
This work performs an unprecedented exploration into the large-scale discovery of recurring audio events in a diverse corpus of one million YouTube videos to apply a streaming, nonparametric clustering algorithm to both spectral features and out-of-domain neural audio embeddings.
Sound Event Detection Using Point-Labeled Data
  • B. Kim, B. Pardo
  • Computer Science
    2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
TLDR
This work illustrates methods to train a SED model on point-labeled data and shows that a model trained on point labeled audio data significantly outperforms weak models and is comparable to a modeltrained on strongly labeled data.
...
1
2
3
4
5
...