Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

@article{Koizumi2020SpeechEU,
  title={Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention},
  author={Yuma Koizumi and Kohei Yatabe and Marc Delcroix and Yoshiki Masuyama and Daiki Takeuchi},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={181-185}
}
  • Yuma KoizumiK. Yatabe Daiki Takeuchi
  • Published 14 February 2020
  • Computer Science
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)-based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our… 

Figures and Tables from this paper

Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation

This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer and proposes a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training.

Neural Noise Embedding for End-To-End Speech Enhancement with Conditional Layer Normalization

A novel enhancement architecture is introduced that integrates a deep autoencoder with neural noise embedding and a new normalization method, termed conditional layer normalization (CLN), is introduced to improve the generalization of deep learning based speech enhancement approaches for unseen environments.

Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models

A generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses is introduced and a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition

This work proposes an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system and achieves relative improvements on the artificial noisy test set.

Zero-Shot Personalized Speech Enhancement Through Speaker-Informed Model Selection

  • Aswin SivaramanMinje Kim
  • Computer Science
    2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2021
A novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model that can outperform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.

Audio-Visual Speech Enhancement Method Conditioned in the Lip Motion and Speaker-Discriminative Embeddings

Experimental results show that the method directly extracted robust speaker embeddings from the noisy audio without an enrollment procedure and improved the enhancement performance compared with the conventional AVSE methods.

Self-Attention With Restricted Time Context And Resolution In Dnn Speech Enhancement

This work shows that a restriction of the employed temporal context in the self-attention layers of a CNN-based network architecture is crucial for good speech enhancement performance and proposes to combine restricted attention with a subsampled attention variant that considers long-term context with a lower temporal resolution.

RAT: RNN-Attention Transformer for Speech Enhancement

This paper proposes an improved Transformer model called RNN-Attention Transformer (RAT), which applies multi-head self-attention (MHSA) to the temporal dimension and shows that RAT significantly reduces parameters and improves performance compared to the baseline.

Interactive Speech and Noise Modeling for Speech Enhancement

This paper proposes a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net, and designs a feature extraction module, namely residual-convolution-and-attention (RA), to capture the correlations along temporal and frequency dimensions for both the speech and the noises.
...

References

SHOWING 1-10 OF 33 REFERENCES

Speaker Representations for Speaker Adaptation in Multiple Speakers' BLSTM-RNN-Based Speech Synthesis

Experimental results show that the speaker representations input to the first layer of acoustic model can effectively control speaker identity during speaker adaptive training, thus improving the synthesized speech quality of speakers included in training phase.

Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech

Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors.

DNN-Based Speech Synthesis Using Speaker Codes

Investigating the effectiveness of introducing speaker codes to DNN acoustic models for speech synthesis for two tasks: multi-speaker modeling and speaker adaptation and found that the proposed model outperformed conventional methods, especially when using a small number of target speaker utterances.

Single-channel Speech Extraction Using Speaker Inventory and Attention Network

  • Xiong XiaoZhuo Chen Y. Gong
  • Physics
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A novel speech extraction method that utilizes an inventory of voice snippets of possible interfering speakers, or speaker enrollment data, in addition to that of the target speaker is proposed, and an attention-based network architecture is proposed to form time-varying masks for both the target and other speakers during the separation process.

A study of speaker adaptation for DNN-based speech synthesis

An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations.

Investigations on Data Augmentation and Loss Functions for Deep Learning Based Speech-Background Separation

Data-augmented training combined with a novel loss function yields improvements in signal to distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) as compared to the best published result on CHiME-2 medium vocabulary data set when using a CNN+BLSTM network.

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

This paper addresses the problem of single channel speech recognition of a target speaker in a mixture of speech signals. We propose to exploit auxiliary speaker information provided by an adaptation

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

This paper introduces SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker and shows the benefit of including speaker information in the processing and the effectiveness of the proposed method.

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks.

In this paper we propose the utterance-level Permutation Invariant Training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep learning based solution for speaker independent

Speech Denoising with Deep Feature Losses

An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners.