Corpus ID: 221266201

CRNNs for Urban Sound Tagging with spatiotemporal context

@article{Arnault2020CRNNsFU,
  title={CRNNs for Urban Sound Tagging with spatiotemporal context},
  author={Augustin Arnault and Nicolas Riche},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.10413}
}
This paper describes CRNNs we used to participate in Task 5 of the DCASE 2020 challenge. This task focuses on hierarchical multilabel urban sound tagging with spatiotemporal context. The code is available on our GitHub repository at this https URL. 

Figures and Tables from this paper

Convolutional Neural Networks Based System for Urban Sound Tagging with Spatiotemporal Context
TLDR
This paper proposes convolutional neural networks (CNNs) based system for UST with spatiotemporal context, and shows that the proposed system significantly outperform the baseline system on the evaluation metrics. Expand
Urban Sound Classification : striving towards a fair comparison
TLDR
This paper presents the DCASE 2020 task 5 winning solution which aims at helping the monitoring of urban noise pollution, and provides a fair comparison by using the same input representation, metrics and optimizer to assess performances. Expand
A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal Context
TLDR
An accompanying hierarchical label taxonomy is introduced for SINGA:PURA, a strongly labelled polyphonic urban sound dataset with spatiotemporal context designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Expand
AUDIO SCENE CLASSIFICATION USING ENHANCED CONVOLUTIONAL NEURAL NETWORKS FOR DCASE 2021 CHALLENGE Technical Report
This technical report describes our system proposed for Task 1B – Audio-Visual Scene Classification of the DCASE 2021 Challenge. Our system focuses in the audio signal based classification. TheExpand

References

SHOWING 1-10 OF 22 REFERENCES
SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context
TLDR
The data collection procedure is described and evaluation metrics for multilabel classification of urban sound tags are proposed and the results of a simple baseline model that exploits spatiotemporal information are reported. Expand
SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network
TLDR
This work describes the collection of this dataset, metrics used to evaluate tagging systems, and the results of a simple baseline model for realworld urban noise monitoring. Expand
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers. Expand
Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization
TLDR
A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN). Expand
Polyphonic Sound Event Detection with Weak Labeling
Sound event detection (SED) is the task of detecting the type and the onset and offset times of sound events in audio streams. It is useful for purposes such as multimedia retrieval and surveillance.Expand
A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling
TLDR
This paper builds a neural network called TALNet, which is the first system to reach state-of-the-art audio tagging performance on Audio Set, while exhibiting strong localization performance on the DCASE 2017 challenge at the same time. Expand
Time2Vec: Learning a Vector Representation of Time
TLDR
This paper provides a model-agnostic vector representation for time, called Time2Vec, that can be easily imported into many existing and future architectures and improve their performances. Expand
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
TLDR
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Expand
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point. Expand
Group Normalization
TLDR
Group Normalization can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. Expand
...
1
2
3
...