Enhancing into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders

  title={Enhancing into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders},
  author={Jonah Casebeer and Vinjai Vale and Umut Isik and Jean-Marc Valin and Ritwik Giri and Arvindh Krishnaswamy},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech out-put. However, these models are tightly coupled with speech content, and produce unintended outputs in noisy conditions. Based on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer encoders and accompanying decoders, and show that they operate well in noisy conditions. We also observe that a… 

Figures and Tables from this paper

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN
An end-to-end neural generative codec with a VQ-VAE based auto-encoder and the generative adversarial network (GAN), which achieves reconstructed speech with high-fidelity at a low bit-rate about 2 kb/s.
Practical cognitive speech compression
This paper presents a new neural speech compression method that is practical in the sense that it operates at low bitrate, introduces a low latency, is compatible in computational complexity with
Cognitive Coding of Speech
The effect of dimensionality reduction and low bitrate quantization on the extracted representations of speech attributes reaches, and for some speech attributes even exceeds, that of state-of-the-art approaches.
Revisiting Speech Content Privacy
This paper presents several scenarios that indicate a need for speech content privacy even as the techniques to achieve content privacy may necessarily vary.


Robust Low Rate Speech Coding Based on Cloned Networks and Wavenet
This work presents a new speech-coding scheme that is based on features that are robust to the distortions occurring in speech- coder input signals, and is additionally robust to noisy input.
Wavenet Based Low Rate Speech Coding
This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.
Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder
This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.
LPCNET: Improving Neural Speech Synthesis through Linear Prediction
  • J. Valin, J. Skoglund
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.
A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet
It is demonstrated that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPC net can exceed the quality of a waveform codec operating at low bitrate, opening the way for new codec designs based on neural synthesis models.
A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement
  • J. Valin
  • Computer Science
    2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP)
  • 2018
This paper demonstrates a hybrid DSP/deep learning approach to noise suppression that achieves significantly higher quality than a traditional minimum mean squared error spectral estimator, while keeping the complexity low enough for real-time operation at 48 kHz on a low-power CPU.
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss
The novel PoCoNet architecture is a convolutional neural network that is able to more efficiently build frequency-dependent features in the early layers, and a new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality.
Speech enhancement with weighted denoising auto-encoder
A novel speech enhancement method with Weighted Denoising Auto-encoder (WDA) is proposed, which could achieve similar amount of noise reduction in both white and colored noise, and the distortion on the level of speech signal is smaller.
Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders
WaveNet is used as the decoder and to generate waveform data directly from the latent representation and the low complexity of latent representations is improved with two alternative disentanglement learning methods, namely instance normalization and sliced vector quantization.
Unsupervised Speech Representation Learning Using WaveNet Autoencoders
A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.