SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection

  title={SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection},
  author={Thi Ngoc Tho Nguyen and Karn Nichakarn Watcharasupat and Ngoc Khanh Nguyen and Douglas L. Jones and Woonseng Gan},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called <italic>Spatial… 

Data Augmentation and Squeeze-and-Excitation Network on Multiple Dimension for Sound Event Localization and Detection in Real Scenes

This work applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics and proposes original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist.

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

Experimental results indicate that the ability to generalize to different environments and unbalanced performance among different classes are two main challenges.

A Method Based on Dual Cross-Modal Attention and Parameter Sharing for Polyphonic Sound Event Localization and Detection

Experimental results demonstrate that the efficient model using one common decoder block based on the DCMA to predict multiple events in the track-wise output format is effective for the SELD task with up to three overlapping events.

A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

  • Jinbo HuYin Cao J. Yang
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
A trackwise ensemble event independent network with a novel data augmentation method based on the previous proposed Event-Independent Network V2 and extended by conformer blocks and dense blocks is proposed to solve an ensemble model problem for track-wise output format that track permutation may occur among different models.



Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction

Experimental results for the ChiME-3 challenge show that both the proposed MVDR beamformer with microphone gains and the MCNR postprocessing improve the speech recognition performance significantly.

Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

This ACCDOA-based system with efficient network architecture called RD3Net and data augmentation techniques outperformed state-of-the-art SELD systems in terms of localization and locationdependent detection and proposes impulse response simulation (IRS), which generates simulated multi-channel signals.

Accdoa: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization And Detection

In experimental evaluations with the DCASE 2020 Task 3 dataset, the ACCDOA representation outperformed the two-branch representation in SELD metrics with a smaller network size and performed better than state-of-the-art SELD systems in terms of localization and location-dependent detection.

An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection

The proposed EINV2 for joint SED and DoA estimation outperforms previous methods by a large margin, and has comparable performance to state-of-the-art ensemble models.

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge, and an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy

Experimental results show that the proposed two-stage polyphonic sound event detection and localization method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

The proposed convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3-D) space is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios.

A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

To investigate the individual and combined effects of ambient noise, interferers, and reverberation, the performance of the baseline on different versions of the dataset excluding or including combinations of these factors indicates that by far the most detrimental effects are caused by directional interferers.

Robust DOA estimation of multiple speech sources

A combination of noise-floor tracking, onset detection and a coherence test to robustly identify time-frequency bins where only one source is dominant and the directions of arrival of the sources are estimated based on the cluster centroids.


A network with successive blocks of multiscale filters to discriminate and extract overlapping classes with different spectral characteristics and an output format and permutation invariant training loss that enable the network to detect, classify, and localize multiple instances of the same class simultaneously are implemented.