One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement

  title={One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement},
  author={Hassan Taherian and Sefik Emre Eskimez and Takuya Yoshioka and Huaming Wang and Zhuo Chen and Xuedong Huang},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • H. TaherianS. Eskimez Xuedong Huang
  • Published 20 October 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
With the recent surge of video conferencing tools usage, providing high-quality speech signals and accurate captions have become essential to conduct day-to-day business or connect with friends and families. Single-channel personalized speech enhancement (PSE) methods show promising results compared with the unconditional speech enhancement (SE) methods in these scenarios due to their ability to remove interfering speech in addition to the environmental noise. In this work, we leverage spatial… 

Figures and Tables from this paper

Array Configuration-Agnostic Personalized Speech Enhancement using Long-Short-Term Spatial Coherence

—Personalized speech enhancement (PSE) has been a field of active research for suppression of speech-like interferers such as competing speakers or TV dialogues. Compared with single-channel

VarArray: Array-Geometry-Agnostic Continuous Speech Separation

VarArray, an array-geometry-agnostic speech separation neural network model that adapts different elements that were proposed before separately, including transform-average-concatenate, conformer speech separation, and inter-channel phase differences, and combines them in an efficient and cohesive way is proposed.

Multi-channel target speech enhancement based on ERB-scaled spatial coherence features

Recently, speech enhancement technologies that are based on deep learning have received considerable research attention. If the spatial information in microphone signals is exploited, microphone

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

An end-to-end enhancement (E3Net) model architecture is proposed, which is 3 × faster than a baseline STFT-based model, and KD techniques are used to develop compressed student models without significantly degrading quality.

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

This work shows that existing PSE methods suffer from a trade-off between speech over-suppression and interference leakage by addressing one problem at the expense of the other, and proposes a new PSE model training framework using cross-task knowledge distillation to mitigate this trade-offs.

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

The results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing, but the competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across this study, which calls for future research on better upstream features.

Challenges and Opportunities in Multi-device Speech Processing

We review current solutions and technical challenges for automatic speech recognition, keyword spotting, device arbitration, speech enhancement, and source localization in multi-device home



Personalized speech enhancement: new models and Comprehensive evaluation

The results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models, and the multi-task training can alleviate the TSOS issue in addition to improving thespeech recognition accuracy.

Scene-Agnostic Multi-Microphone Speech Dereverberation

This paper presents an NN architecture that can cope with microphone arrays whose number and positions of the microphones are unknown, and demonstrates its applicability in the speech dereverberation task.

Channel-Attention Dense U-Net for Multichannel Speech Enhancement

This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.

Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement

This paper aims to train a unique deep neural network (DNN) potentially performing well on unseen microphone arrays, and designs three variants of the recently proposed narrowband network to cope with the agnostic number of microphones.

Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement

A real-time speech enhancement model that separates a target speaker from a noisy multi-talker mixture without compromising on complexity of the recently proposed PercepNet is presented.

End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

This paper proposes transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation based on the filter-and-sum network, and shows how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays.

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

A multi-task training framework to make the monaural speech enhancement models unharmful to ASR and improves the word error rate for the SE output by 11.82% with little compromise in the SE quality is proposed.

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation

This study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation, and integrates multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation.

Phase-aware Speech Enhancement with Deep Complex U-Net

A novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure and achieves state-of-the-art performance in all metrics.

All-Neural Multi-Channel Speech Enhancement

This study proposes a novel all-neural approach for multichannel speech enhancement, where robust speaker localization, acoustic beamforming, post-filtering and spatial filtering are all done using