Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement

  title={Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement},
  author={Haoyu Li and Junichi Yamagishi},
In recent years, speech enhancement (SE) has achieved impressive progress with the success of deep neural networks (DNNs). However, the DNN approach usually fails to generalize well to unseen environmental noise that is not included in the training. To address this problem, we propose "noise tokens" (NTs), which are a set of neural noise templates that are jointly trained with the SE system. NTs dynamically capture the environment variability and thus enable the DNN model to handle various… 

Figures and Tables from this paper

NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional Resampling

Noise adaptive speech enhancement with target-conditional resampling (NASTAR), which reduces mismatches with only one sample (one-shot) of noisy speech in the target environment, is proposed.

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

Experimental results show that the proposed encoder-decoder neural network can generate a professional high-quality speech waveform when setting high- quality audio as the reference and improves speech enhancement performance compared with several state-of-the-art baseline systems.

Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

This paper presents a neural speech enhancement method that has a statistical feedback mechanism based on a denoising variational autoencoder (VAE) that outperforms the existing mask-based and generative enhancement methods in unknown conditions.

OSSEM: one-shot speaker adaptive speech enhancement using meta learning

Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results, and also exhibits a competitive performance compared to state-of-the-art causal SE systems.

MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

MetricGAN-U, which stands for MetricGANunsupervised, is proposed, to further release the constraint from conventional unsupervised learning, where only noisy speech is required to train the model by optimizing non-intrusive speech quality metrics.

Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation

The experimental results indicate that the DNR-HiNet vocoder was able to generate a denoised and dereverberated waveform given noisy and reverberant acoustic features and outperformed the original Hi net vocoder and a few other neural vocoders.

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

The promising results reveal that the developed CITISEN mobile application can potentially be used as a front-end processor for various speech-related services such as voice communication, assistive hearing devices, and virtual reality headsets.

InQSS: a speech intelligibility assessment model using a multi-task learning network

This study proposes InQSS, a speech intelligibility assessment model that uses both spectrogram and scattering coefficients as input features and uses a multi-task learning network that can predict not only the intelligibility scores but also the quality scores of a speech.

Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

This paper compares multiple representations of linguistic context by conditioning a Text-to-Speech model on features of the preceding utterance, and shows that appropriate representations of either text or acoustic context alone yield significantly better naturalness than a baseline that does not use context.

Joint Noise Reduction and Listening Enhancement for Full-End Speech Enhancement

Speech enhancement (SE) methods mainly focus on recovering clean speech from noisy input. In real-world speech communication, however, noises often exist in not only speaker but also listener



Dynamic noise aware training for speech enhancement based on deep neural networks

Three algorithms to address the mismatch problem in deep neural network (DNN) based speech enhancement are proposed and can well suppress highly non-stationary noise better than all the competing state-of-the-art techniques.

A Regression Approach to Speech Enhancement Based on Deep Neural Networks

The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.

Speaker Independence of Neural Vocoders and Their Effect on Parametric Resynthesis Speech Enhancement

  • Soumi MaitiMichael I. Mandel
  • Physics
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This work shows that when trained on data from enough speakers, neural vocoders can generate speech from unseen speakers, both male and female, with similar quality as seen speakers in training, and shows that objective signal and overall quality is higher than the state-of-the-art speech enhancement systems Wave-U-Net, Wavenet-denoise, and SEGAN.

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.

T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement

A Transformer with Gaussian-weighted self-attention (T-GSA), whose attention weights are attenuated according to the distance between target and context symbols, which has significantly improved speech-enhancement performance, compared to the Transformer and RNNs.

Efficient Neural Audio Synthesis

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Exploring Tradeoffs in Models for Low-Latency Speech Enhancement

This work explores a variety of neural networks configurations for one- and two-channel spectrogram-mask-based speech enhancement and finds that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of the best bidirectional model.

An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech

A short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise reduction) of three different listening experiments and showed better correlation with speech intelligibility compared to five other reference objective intelligible models.

Towards End-To-End Speech Recognition with Recurrent Neural Networks

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the