Audio Set: An ontology and human-labeled dataset for audio events

  title={Audio Set: An ontology and human-labeled dataset for audio events},
  author={Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • J. Gemmeke, D. Ellis, Marvin Ritter
  • Published 5 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. [] Key Method Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The…

Figures from this paper

FSD50K: An Open Dataset of Human-Labeled Sound Events
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
Mt-Gcn For Multi-Label Audio Tagging With Noisy Labels
MT-GCN is presented, a Multi-task Learning based Graph Convolutional Network that learns domain knowledge from ontology that outperforms the baseline methods by a significant margin.
Audio Caption: Listen and Tell
  • Mengyue Wu, Heinrich Dinkel, Kai Yu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A manually-annotated dataset for audio caption is introduced to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image.
Improved Representation Learning For Acoustic Event Classification Using Tree-Structured Ontology
It is shown that by organizing audio representations with a human-curated tree ontology, this framework can improve the quality of the learned audio representations for downstream AEC tasks and achieve comparable performance in discriminative tasks as fully supervised baselines.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events
An Audio-Grounding dataset is contributed, which provides the correspondence be-tween sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event.
Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection
This work addresses Sound Event Detection in the case where a weakly annotated dataset is available for training, and explores an approach inspired by Multiple Instance Learning, in which a convolutional recurrent neural network is trained to give predictions at frame-level using a custom loss function based on the weak labels and the statistics of the frame-based predictions.
Audio-text Retrieval in Context
This work uses pre-trained audio features and a descriptor-based aggregation method to build a contextual audio-text retrieval system and observes that semantic mapping is more important than temporal relations in contextual retrieval.
This work employs a subset of Google’s AudioSet, which is a large number of weakly labeled YouTube video excerpts, and employs multiple instance learning (MIL) approaches to deal with weak labels by bagging them to positive or negative bags.
Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
  • Yu Wu, Yi Yang
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
The cross-modal audio-visual contrastive learning to induce temporal difference on attention models within videos, i.e., urging the model to pick the current temporal segment from all context candidates by a large margin is proposed.


AudioSentibank: Large-scale Semantic Ontology of Acoustic Concepts for Audio Content Analysis
This work introduces the AudioSentiBank corpus, which is a large-scale folksology containing over 1,123 adjective and verb noun pairs, and explores for the first time the classification's performance of acoustic concepts pairs.
Noisemes: Manual Annotation of Environmental Noise in Audio Streams
42 distinct labels, the “noisemes”, developed for the manual annotation of noise segments as they occur in audio streams of consumer captured and semiprofessionally produced videos are introduced.
CNN architectures for large-scale audio classification
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
TUT database for acoustic scene classification and sound event detection
The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.
A Dataset and Taxonomy for Urban Sound Research
A taxonomy of urban sounds and a new dataset, UrbanSound, containing 27 hours of audio with 18.5 hours of annotated sound event occurrences across 10 sound classes are presented.
ImageNet Large Scale Visual Recognition Challenge
The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.
Automatic Acquisition of Hyponyms from Large Text Corpora
A set of lexico-syntactic patterns that are easily recognizable, that occur frequently and across text genre boundaries, and that indisputably indicate the lexical relation of interest are identified.
Sound Ontology for Computational Auditory Scence Analysis
This paper proposes that sound ontology should be used both as a common vocabulary for sound representation and as a common terminology for integrating various sound stream segregation systems. Since
CLEAR Evaluation of Acoustic Event Detection and Classification Systems
In this paper, the various systems for the tasks of AED and AEC and their results are presented.
Content-Based Classification, Search, and Retrieval of Audio
The audio analysis, search, and classification engine described here reduces sounds to perceptual and acoustical features, which lets users search or retrieve sounds by any one feature or a combination of them, by specifying previously learned classes based on these features.