Audio Set: An ontology and human-labeled dataset for audio events

@article{Gemmeke2017AudioSA,
  title={Audio Set: An ontology and human-labeled dataset for audio events},
  author={Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2017},
  pages={776-780}
}
  • J. Gemmeke, D. Ellis, Marvin Ritter
  • Published 5 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. [] Key Method Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The…

Figures from this paper

FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Mt-Gcn For Multi-Label Audio Tagging With Noisy Labels
TLDR
MT-GCN is presented, a Multi-task Learning based Graph Convolutional Network that learns domain knowledge from ontology that outperforms the baseline methods by a significant margin.
Audio Caption: Listen and Tell
  • Mengyue Wu, Heinrich Dinkel, Kai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.
Improved Representation Learning For Acoustic Event Classification Using Tree-Structured Ontology
TLDR
It is shown that by organizing audio representations with a human-curated tree ontology, this framework can improve the quality of the learned audio representations for downstream AEC tasks and achieve comparable performance in discriminative tasks as fully supervised baselines.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events
TLDR
An Audio-Grounding dataset is contributed, which provides the correspondence be-tween sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event.
Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection
TLDR
This work addresses Sound Event Detection in the case where a weakly annotated dataset is available for training, and explores an approach inspired by Multiple Instance Learning, in which a convolutional recurrent neural network is trained to give predictions at frame-level using a custom loss function based on the weak labels and the statistics of the frame-based predictions.
FMA: A Dataset for Music Analysis
TLDR
The Free Music Archive is introduced, an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections, and some suitable MIR tasks are discussed.
LARGE-SCALE WEAKLY SUPERVISED SOUND EVENT DETECTION ( DCASE CHALLENGE 2017 )
TLDR
This work employs a subset of Google’s AudioSet, which is a large number of weakly labeled YouTube video excerpts, and employs multiple instance learning (MIL) approaches to deal with weak labels by bagging them to positive or negative bags.
Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
  • Yu Wu, Yi Yang
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
The cross-modal audio-visual contrastive learning to induce temporal difference on attention models within videos, i.e., urging the model to pick the current temporal segment from all context candidates by a large margin is proposed.
...
...

References

SHOWING 1-10 OF 22 REFERENCES
AudioSentibank: Large-scale Semantic Ontology of Acoustic Concepts for Audio Content Analysis
TLDR
This work introduces the AudioSentiBank corpus, which is a large-scale folksology containing over 1,123 adjective and verb noun pairs, and explores for the first time the classification's performance of acoustic concepts pairs.
Noisemes: Manual Annotation of Environmental Noise in Audio Streams
TLDR
42 distinct labels, the “noisemes”, developed for the manual annotation of noise segments as they occur in audio streams of consumer captured and semiprofessionally produced videos are introduced.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
TUT database for acoustic scene classification and sound event detection
TLDR
The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.
A Dataset and Taxonomy for Urban Sound Research
TLDR
A taxonomy of urban sounds and a new dataset, UrbanSound, containing 27 hours of audio with 18.5 hours of annotated sound event occurrences across 10 sound classes are presented.
ImageNet Large Scale Visual Recognition Challenge
TLDR
The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.
Automatic Acquisition of Hyponyms from Large Text Corpora
TLDR
A set of lexico-syntactic patterns that are easily recognizable, that occur frequently and across text genre boundaries, and that indisputably indicate the lexical relation of interest are identified.
Sound Ontology for Computational Auditory Scence Analysis
This paper proposes that sound ontology should be used both as a common vocabulary for sound representation and as a common terminology for integrating various sound stream segregation systems. Since
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
TLDR
In this paper, the various systems for the tasks of AED and AEC and their results are presented.
Content-Based Classification, Search, and Retrieval of Audio
TLDR
The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features, which lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features.
...
...