Handling Background Noise in Neural Speech Generation

@article{Denton2020HandlingBN,
  title={Handling Background Noise in Neural Speech Generation},
  author={Tom Denton and Alejandro Luebs and Felicia S. C. Lim and Andrew Storus and Hengchin Yeh and W. Kleijn and Jan Skoglund},
  journal={2020 54th Asilomar Conference on Signals, Systems, and Computers},
  year={2020},
  pages={667-671}
}
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best… 

Figures and Tables from this paper

Bioacoustic Event Detection with Self-Supervised Contrastive Learning

The findings of this paper establish the validity of unsupervised bioacoustic event detection using deep neural networks and self-supervised contrastive learning as an effective alternative to conventional techniques that leverage supervised methods for signal presence indication.

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Experimental results show that the models trained on synthetic data this way can produce high quality audio dis-playing accent transfer, while preserving speaker characteristics such as speaking style.

References

SHOWING 1-10 OF 26 REFERENCES

Efficient Neural Audio Synthesis

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

Freesound technical demo

This demo wants to introduce Freesound to the multimedia community and show its potential as a research resource.

Unsupervised Sound Separation Using Mixture Invariant Training

This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.

Performance Study of a Convolutional Time-Domain Audio Separation Network for Real-Time Speech Denoising

It is shown that a large part of the increase in performance between a causal and non-causal model is achieved with a lookahead of only 20 milliseconds, demonstrating the usefulness of even small lookaheads for many real-time applications.

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.

A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet

It is demonstrated that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPC net can exceed the quality of a waveform codec operating at low bitrate, opening the way for new codec designs based on neural synthesis models.

A 1200/2400 bps coding suite based on MELP

Key algorithm features of the future NATO narrow band voice coder (NBVC) are presented, a 1.2/2.4 kbps speech coder with noise preprocessor based on the MELP analysis algorithm that achieves quality close to the existing federal standard 2.2 kbps.