Polyphonic training set synthesis improves self-supervised urban sound classification.

@article{Gontier2021PolyphonicTS,
  title={Polyphonic training set synthesis improves self-supervised urban sound classification.},
  author={F{\'e}lix Gontier and Vincent Lostanlen and Mathieu Lagrange and Nicolas Fortin and Catherine Lavandier and Jean-François Petiot},
  journal={The Journal of the Acoustical Society of America},
  year={2021},
  volume={149 6},
  pages={
          4309
        }
}
Machine listening systems for environmental acoustic monitoring face a shortage of expert annotations to be used as training data. To circumvent this issue, the emerging paradigm of self-supervised learning proposes to pre-train audio classifiers on a task whose ground truth is trivially available. Alternatively, training set synthesis consists in annotating a small corpus of acoustic events of interest, which are then automatically mixed at random to form a larger corpus of polyphonic scenes… 

Multilabel Acoustic Event Classification Using Real-World Urban Data and Physical Redundancy of Sensors

A two-stage classifier able to identify, in real time, a set of up to 21 urban acoustic events that may occur simultaneously (i.e., multilabel), taking advantage of physical redundancy in acoustic sensors from a wireless acoustic sensors network is proposed.

Audio Self-supervised Learning: A Survey

An overview of the SSL methods used for audio and speech processing applications, the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain are summarized.

Multidimensional analyses of the noise impacts of COVID-19 lockdown

As part of the Agence Nationale de Recherche Caractérisation des ENvironnements SonorEs urbains (Characterization of urban sound environments) project, a questionnaire was sent in January 2019 to

References

SHOWING 1-10 OF 75 REFERENCES

Audio Tagging by Cross Filtering Noisy Labels

This article presents a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging, and achieves state-of-the-art performance and even surpasses the ensemble models on FSDKaggle2018 dataset.

Learning Sound Event Classifiers from Web Audio with Noisy Labels

Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.

Adaptive Pooling Operators for Weakly Labeled Sound Event Detection

This paper treats SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality, and develops a family of adaptive pooling operators—referred to as autopool—which smoothly interpolate between common pooling Operators, and automatically adapt to the characteristics of the sound sources in question.

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

The emergence of deep learning as the most popular classification method is observed, replacing the traditional approaches based on Gaussian mixture models and support vector machines.

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.

Pre-Training Audio Representations With Self-Supervision

This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.

Scaper: A library for soundscape synthesis and augmentation

Given a collection of iso-lated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, “specification”, to increase the variability of the output.

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.

Estimation of the Perceived Time of Presence of Sources in Urban Acoustic Environments Using Deep Learning Techniques

This paper demonstrates, on a controlled dataset, that machine learning techniques based on state of the art neural architectures can predict the perceived time of presence of several sound sources at a sufficient accuracy.
...