PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction

  title={PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction},
  author={Yi Ma and Kong-Aik Lee and Ville Hautam{\"a}ki and Haizhou Li},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  • Yi MaKong-Aik Lee Haizhou Li
  • Published 3 October 2021
  • Computer Science
  • 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise. However, excessive suppression may lead to speech distortion and speaker information loss, which degrades the performance of speaker embedding extraction. To alleviate this problem, we propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction. This framework is optimized based on the feedback of the speaker identification task… 

Figures and Tables from this paper

Learning Noise Robust ResNet-Based Speaker Embedding for Speaker Recognition

Two new variants of ResNet-based speaker recognition systems are proposed that make the speaker embedding more robust against additive noise and reverberation and extract x-vectors in noisy environments that are close to their corresponding x-vector in a clean environment.

MFA: TDNN with Multi-Scale Frequency-Channel Attention for Text-Independent Speaker Verification with Short Utterances

The proposed multi-scale frequency-channel attention (MFA) framework can achieve state-of-the-art performance while reducing parameters and computation complexity and the MFA mechanism is found to be effective for speaker verification with short test utterances.

Selective Listening by Synchronizing Speech With Lips

A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech

Barlow Twins self-supervised learning for robust speaker recognition

In the proposed system, the Barlow Twins objective function is calculated in the embedding layer and it is optimized jointly with the speaker classifier loss function, integrated with the ResNet-based speaker embedding system.



Robust Speaker Recognition Using Speech Enhancement And Attention Model

The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in the authors' experiments.

VoiceID Loss: Speech Enhancement for Speaker Verification

The proposed VoiceID loss is a novel loss function for training a speech enhancement model to improve the robustness of speaker verification and consistently improves the speaker verification system on both clean and noisy conditions.

Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension

An end-to-end time-domain framework for noise-robust bandwidth extension, that jointly optimizes a mask-based speech enhancement and an ideal bandwidth extension module with multi-task learning, is proposed.

Feature Enhancement with Deep Feature Losses for Speaker Verification

This work uses Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network to propose a feature-domain supervised denoising based solution for speaker verification.

Analysis of Deep Feature Loss based Enhancement for Speaker Verification

This work analyzes various facets of the proposed enhancement network in the activation space of a pre-trained auxiliary network, and designs several dereverberation schemes to conclude ineffectiveness of deep feature loss enhancement scheme for this task.

A study on data augmentation of reverberant speech for robust speech recognition

It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.

On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement

The proposed techniques to address cross-corpus generalization include channel normalization, better training corpus, and smaller frame shift in short-time Fourier transform (STFT) can improve the objective intelligibility and quality scores on untrained corpora significantly.

Complex Ratio Masking for Monaural Speech Separation

The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

ICASSP 2021 Deep Noise Suppression Challenge

A DNS challenge special session at INTERSPEECH 2020 was organized where the open-sourced training and test datasets were opened and a subjective evaluation framework was opened and used to evaluate and select the final winners.

Learning Complex Spectral Mapping for Speech Enhancement with Improved Cross-Corpus Generalization

This study proposes a long short-term memory (LSTM) network for complex spectral mapping and examines the importance of training corpus for cross-corpus generalization, finding that a training corpus that contains utterances with different channels can significantly improve performance on untrained corpora.