L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing

  title={L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing},
  author={Erico Guizzo and Riccardo F. Gramaccioni and Saeid Jamili and Christian Marinoni and Edoardo Massaro and Claudia Medaglia and Giuseppe Nachira and Leonardo Nucciarelli and Ludovica Paglialunga and Marco Pennese and Sveva Pepe and Enrico Rocchi and Aurelio Uncini and Danilo Comminiello},
  journal={2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)},
The L3DAS21 Challenge11www.13das.com/mlsp2021 is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing, with particular focus on 3D speech enhancement (SE) and 3D sound localization and detection (SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage. Usually, machine learning approaches to 3D audio tasks are… 

Figures from this paper

A Survey of Sound Source Localization with Deep Learning Methods

An extensive topography of the neural network-based sound source localization literature is provided, organized according to the neuralnetwork architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy.

Lightweight Convolutional Neural Networks By Hypercomplex Parameterization

This paper defines the parameterization of hypercomplex convolutional layers to develop lightweight and efficient large-scale convolutionAL models and demonstrates the versatility of this approach to multiple domains of application.

Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

A multimodal fusion module equipped with two pseudo-siamese audio- visual context fusion (ACF) blocks is proposed to effectively conduct audio-visual interaction and achieves state-of-the-art results on the ASOD60K dataset.


This paper proposes and implements the following approaches to design immersive audio experiences that fully exploit the abilities of 3D audio and shows that the approach was able to successfully separate the stems and simulate a dimensional sound effect.

L3DAS22: Exploring Loss Functions for 3D Speech Enhancement

This work explores the effects of different speech enhancement loss functions traditionally used for monophonic signals when applied to the L3DAS22 Challenge 3D Speech Enhancement Task. In addition

A Perceptual Loss Based Complex Neural Beamforming for Ambix 3D Speech Enhancement

This work proposes a novel approach to B-Format AmbiX 3D speech enhancement based on the short-time Fourier transform (STFT) representation, a Fully Complex Convolutional Network that estimates a mask to be applied to the input features that achieves a score of 0.845 in the metric proposed by the challenge.

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings.

L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment

A new dataset is generated, which maintains the same general characteristics of L3DAS21 datasets, but with an extended number of data points and adding constrains that improve the baseline model’s efficiency and overcome the major difficulties encountered by the participants of the previous challenge.

PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions

This paper introduces the parameterization of hypercomplex convolutional layers and introduces the family of parameterized hypercomplex neural networks (PHNNs) that are lightweight and efficient large-scale models that outperforms real and quaternion-valued counterparts.



FSD50K: An Open Dataset of Human-Labeled Sound Events

FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation

The experimental results show that the proposed extended DBnet using a convolutional-recurrent post masking network outperforms state-of-the-art source separation methods.

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

An overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge, presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

An Iterative Graph Spectral Subtraction Method for Speech Enhancement

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge, and an updated version of the one used in the previous challenge, with input features and training modifications to improve its performance.

Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings

This study evaluates the replacing of the recurrent LSTM network previously investigated by a convolutive U-net under more stressing conditions with an additional second competitive speaker, and results indicate that the use of dilated convolutive layers is beneficial in difficult situations with two interfering speakers.

A Review of Deep Learning Based Methods for Acoustic Scene Classification

This article summarizes and groups existing approaches for data preparation, i.e., feature representations, feature pre-processing, and data augmentation, and for data modeling, i.