Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings

  title={Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings},
  author={Am'elie Bosca and Alexandre Gu'erin and Laur{\'e}line Perotin and Srdjan Kiti'c},
  journal={2020 28th European Signal Processing Conference (EUSIPCO)},
We present a CNN architecture for speech enhancement from multichannel first-order Ambisonics mixtures. The data-dependent spatial filters, deduced from a mask-based approach, are used to help an automatic speech recognition engine to face adverse conditions of reverberation and competitive speakers. The mask predictions are provided by a neural network, fed with rough estimations of speech and noise amplitude spectra, under the assumption of known directions of arrival. This study evaluates… 
Towards end-to-end speech enhancement with a variational U-Net architecture
Experiments show that the residual (skip) connections in the proposed system are required for successful end-to-end signal enhancement, i.e., without filter mask estimation, and indicate a slight advantage of the variational U-Net architecture over its non-variational version in terms of signal enhancement performance under reverberant conditions.
L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing
This work proposes a novel multichannel audio configuration based multiple-source and multiple-perspective Ambisonic recordings, performed with an array of two first-order Ambisonics microphones, for 3D speech enhancement and 3D sound localization and detection.


Multichannel Speech Separation with Recurrent Neural Networks from High-Order Ambisonics Recordings
This work derives a multichannel spatial filter from a mask estimated by a long short-term memory (LSTM) recurrent neural network that combines one channel of the mixture with the outputs of basic HOA beamformers as inputs to the LSTM, assuming that the authors know the directions of arrival of the directional sources.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
CRNN-Based Multiple DoA Estimation Using Acoustic Intensity Features for Ambisonics Recordings
This work proposes to use a neural network built from stacked convolutional and recurrent layers in order to estimate the directions of arrival of multiple sources from a first-order Ambisonics recording, using features derived from the acoustic intensity vector as inputs.
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
The Wave-U-Net is proposed, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales and indicates that its architecture yields a performance comparable to a state-of-the-art spectrogram-based U- net architecture, given the same data.
Rank-1 constrained Multichannel Wiener Filter for speech recognition in noisy environments
An experimental study on multichannel linear filters in a specific speech recognition task, namely the CHiME-4 challenge, suggests that the speech recognition accuracy correlates more with the Mel-frequency cepstral coefficients (MFCC) feature variance than with the noise reduction or the speech distortion level.
A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation
This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Singing Voice Separation with Deep U-Net Convolutional Networks
This work proposes a novel application of the U-Net architecture — initially developed for medical imaging — for the task of source separation, given its proven capacity for recreating the fine, low-level detail required for high-quality audio reproduction.
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems are reviewed.
Distant speech separation using predicted time-frequency masks from spatial features
The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).
Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR
It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.