Personalized speech enhancement: new models and Comprehensive evaluation

  title={Personalized speech enhancement: new models and Comprehensive evaluation},
  author={Sefik Emre Eskimez and Takuya Yoshioka and Huaming Wang and Xiaofei Wang and Zhuo Chen and Xuedong Huang},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • S. EskimezT. Yoshioka Xuedong Huang
  • Published 18 October 2021
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Personalized speech enhancement (PSE) models utilize additional cues, such as speaker embeddings like d-vectors, to remove background noise and interfering speech in real-time and thus improve the speech quality of online video conferencing systems for various acoustic scenarios. In this work, we propose two neural networks for PSE that achieve superior performance to the previously proposed VoiceFilter. In addition, we create test sets that capture a variety of scenarios that users can… 

Figures and Tables from this paper

The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

In the proposed method, it is shown that the quality of the NSS system’s synthetic data matters, and if they are good enough the augmented dataset can be used to improve the PSE system that outperforms the speaker-agnostic baseline.

Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation with E3Net

A recently proposed causal end-to-end enhancement network (E3Net) is employed and modified to obtain a joint PSE-AEC model and shows that the joint model comes close to the expert models for each task and performs significantly better for the combined PSE and AEC scenario.

Efficient Personalized Speech Enhancement Through Self-Supervised Learning

This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers,

One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement

A new causal array-geometry-agnostic multi-channel PSE model is proposed, which can generate a high-quality enhanced signal from arbitrary microphone geometry and outperforms the model trained on a specific microphone array geometry in both speech quality and automatic speech recognition accuracy.

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

An end-to-end enhancement (E3Net) model architecture is proposed, which is 3 × faster than a baseline STFT-based model, and KD techniques are used to develop compressed student models without significantly degrading quality.

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

This work shows that existing PSE methods suffer from a trade-off between speech over-suppression and interference leakage by addressing one problem at the expense of the other, and proposes a new PSE model training framework using cross-task knowledge distillation to mitigate this trade-offs.

Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

A streaming Transformer-based PSE model is presented and a novel cross-attention approach that gives adaptive target speaker representations is proposed that outperforms competitive baselines consistently, even when the model is only approximately half the size.

Personalized Acoustic Echo Cancellation for Full-duplex Communications

A novel backbone neural network termed as gated temporal convolutional neural network (GTCNN) that outperforms state-of-the-art AEC models in performance is proposed and speaker embeddings like d-vectors are adopted as auxiliary information to guide the GTCNN to focus on the target speaker.

Preserving background sound in noise-robust voice conversion via multi-task learning

Experimental results demonstrate that the proposed end-to-end framework via multi-task learning outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.

ICASSP 2021 Deep Noise Suppression Challenge

A DNS challenge special session at INTERSPEECH 2020 was organized where the open-sourced training and test datasets were opened and a subjective evaluation framework was opened and used to evaluate and select the final winners.



Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement

A real-time speech enhancement model that separates a target speaker from a noisy multi-talker mixture without compromising on complexity of the recently proposed PercepNet is presented.

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

A multi-task training framework to make the monaural speech enhancement models unharmful to ASR and improves the word error rate for the SE output by 11.82% with little compromise in the SE quality is proposed.

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

This work introduces VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system, and shows that such a model can be quantized as a 8-bit integer model and run in realtime.

Phase-aware Speech Enhancement with Deep Complex U-Net

A novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure and achieves state-of-the-art performance in all metrics.

Single-channel Speech Extraction Using Speaker Inventory and Attention Network

  • Xiong XiaoZhuo Chen Y. Gong
  • Physics
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A novel speech extraction method that utilizes an inventory of voice snippets of possible interfering speakers, or speaker enrollment data, in addition to that of the target speaker is proposed, and an attention-based network architecture is proposed to form time-varying masks for both the target and other speakers during the separation process.

Personalized Speech Enhancement through Self-Supervised Data Augmentation and Purification

The proposed data puri-cation step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement and may be seen as privacy-preserving as it does not rely on any clean speech recordings or speaker embeddings.

Dense CNN With Self-Attention for Time-Domain Speech Enhancement

Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.

Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network

A convolutional neural network is proposed to predict the perceived quality of speech with noise, reverberation, and distortions, both intrusively and non-intrusively, i.e., with and without a clean reference signal.

Transformer-Based Acoustic Modeling for Hybrid Speech Recognition

It is demonstrated that on the widely used Librispeech benchmark, the proposed transformer-based AM outperforms the best published hybrid result by 19% to 26% relative when the standard n-gram language model (LM) is used.

Complex Ratio Masking for Monaural Speech Separation

The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.