A General Network Architecture for Sound Event Localization and Detection Using Transfer Learning and Recurrent Neural Network

@article{Nguyen2021AGN,
  title={A General Network Architecture for Sound Event Localization and Detection Using Transfer Learning and Recurrent Neural Network},
  author={Thi Ngoc Tho Nguyen and Ngoc Khanh Nguyen and Huy Phan and Lam Dang Pham and Kenneth Ooi and Douglas L. Jones and Woonseng Gan},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={935-939}
}
Polyphonic sound event detection and localization (SELD) task is challenging because it is difficult to jointly optimize sound event detection (SED) and direction-of-arrival (DOA) estimation in the same network. We propose a general network architecture for SELD in which the SELD network comprises sub-networks that are pre-trained to solve SED and DOA estimation independently, and a recurrent layer that combines the SED and DOA estimation outputs into SELD outputs. The recurrent layer does the… 

Figures and Tables from this paper

SSELDNET: A FULLY END-TO-END SAMPLE-LEVEL FRAMEWORK FOR SOUND EVENT LOCALIZATION AND DETECTION Technical Report

TLDR
This report investigates the possibility to apply representation learning directly to the raw audio and proposes an end-to-end sample-level SELD framework and applies three data augmentation tricks: sound field rotation, time masking and random audio equalization.

Polyphonic audio event detection: multi-label or multi-class multi-task classification problem?

TLDR
This work proposes to frame the AED task as a multi-class classification problem by considering each possible label combination as one class, to circumvent the large number of arising classes due to combinatorial explosion.

What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis

TLDR
Experimental results indicate polyphony as the main challenge in SELD, due to the difference inulty in detecting all sound events of interest, and the SELD systems tend to make fewer errors for the polyphonic scenario that is dominant in the training set.

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

TLDR
This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and locationdependent detection and proposes impulse response simulation (IRS), which generates simulated multi-channel signals.

Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection

TLDR
An impulse response simulation framework (IRS) that augments spatial characteristics using simulated room impulse responses (RIR) and an ablation study to discuss the contribution and need for each component within the IRS.

A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

TLDR
To investigate the individual and combined effects of ambient noise, interferers, and reverberation, the performance of the baseline on different versions of the dataset excluding or including combinations of these factors indicates that by far the most detrimental effects are caused by directional interferers.

SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection

TLDR
A novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources is proposed.

A Review of Sound Source Localization with Deep Learning Methods

TLDR
An exhaustive topography of the neural-based localization literature in this context is provided, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy.

A Survey of Sound Source Localization with Deep Learning Methods

TLDR
An extensive topography of the neural network-based sound source localization literature is provided, organized according to the neuralnetwork architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy.

Extending GCC-PHAT using Shift Equivariant Neural Networks

TLDR
This work proposes a novel approach to extending the GCC-PHAT, where the received signals are changed using a shift equivariant neural network that preserves the timing information contained in the signals.

References

SHOWING 1-10 OF 22 REFERENCES

An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection

TLDR
The proposed EINV2 for joint SED and DoA estimation outperforms previous methods by a large margin, and has comparable performance to state-of-the-art ensemble models.

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

TLDR
The proposed convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3-D) space is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios.

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy

TLDR
Experimental results show that the proposed two-stage polyphonic sound event detection and localization method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.

A Sequence Matching Network for Polyphonic Sound Event Localization and Detection

TLDR
A two-step approach that decouples the learning of the sound event detection and directional-of-arrival estimation systems is proposed, which allows the flexibility in the system design, and increases the performance of the whole sound event localization and detection system.

Ensemble of Sequence Matching Networks for Dynamic Sound Event Localization, Detection, and Tracking

TLDR
In order to estimate directions-of-arrival of moving sound sources with higher required spatial resolutions than those of static sources, this work proposes to separate the directional estimates into azimuth and elevation estimates before passing them to the sequence matching network.

On Multitask Loss Function for Audio Event Detection and Localization

TLDR
This work proposes a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems and use the mean squared error loss homogeneously for model training.

THE USTC-IFLYTEK SYSTEM FOR SOUND EVENT LOCALIZATION AND DETECTION OF DCASE2020 CHALLENGE Technical Report

TLDR
This report proposes an entire technical solution, which consists of data augmentation, network training, model ensemble, and post-processing for DCASE 2020 challenge: Sound Event Localization and Detection (SELD).

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

TLDR
This work combines these two approaches in a convolutional recurrent neural network (CRNN) and applies it on a polyphonic sound event detection task and observes a considerable improvement for four different datasets consisting of everyday sound events.

Robust Source Counting and DOA Estimation Using Spatial Pseudo-Spectrum and Convolutional Neural Network

TLDR
This work proposes to use a 2D convolutional neural network with multi-task learning to robustly estimate the number of sources and the directions-of-arrival from short-time spatial pseudo-spectra, which have useful directional information from audio input signals.

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

TLDR
This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge, and an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.