Generative Speech Coding with Predictive Variance Regularization

@article{Kleijn2021GenerativeSC,
  title={Generative Speech Coding with Predictive Variance Regularization},
  author={W. Kleijn and Andrew Storus and Michael Chinen and Tom Denton and Felicia S. C. Lim and Alejandro Luebs and Jan Skoglund and Hengchin Yeh},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={6478-6482}
}
  • W. Kleijn, Andrew Storus, Hengchin Yeh
  • Published 18 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We… 

Figures and Tables from this paper

Variable-rate discrete representation learning
TLDR
This work proposes slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and shows that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction.
Parallel Synthesis for Autoregressive Speech Generation
TLDR
Compared with the baseline autoregressive and non-autoregressive models, the proposed model achieves better MOS and shows its good generalization ability while synthesizing 44 kHz speech or utterances from unseen speakers.
A Streamwise Gan Vocoder for Wideband Speech Coding at Very Low Bit Rate
TLDR
A GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s and significantly outperforms prior autoregressive vocoders for very low bit rate speech coding, with computational complexity of about 5 GMACs, providing a new state of the art in this domain.
TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN
TLDR
An end-to-end neural generative codec with a VQ-VAE based auto-encoder and the generative adversarial network (GAN), which achieves reconstructed speech with high-fidelity at a low bit-rate about 2 kb/s.
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
TLDR
This work proposes a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR that leverages audio- visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals.
End-to-End Neural Speech Coding for Real-Time Communications
TLDR
The TFNet, an end-to-end neural speech codec with low latency for RTC takes an encoder-temporal filteringdecoder paradigm that has seldom been investigated in audio coding to capture both short-term and long-term temporal dependencies.
Speech Localization at Low Bitrates in Wireless Acoustics Sensor Networks
TLDR
This work analyzes a deep neural network (DNN) based framework performance as a function of the audio encoding bitrate for compressed signals by employing recent communication codecs including PyAWNeS, Opus, EVS, and Lyra and shows that for the best accuracy of the trained model, it is optimal to have the raw data for the second channel.
End-to-End Neural Audio Coding for Real-Time Communications
TLDR
Both subjective and objective results demonstrate the efficiency of the proposed TFNet, an end-to-end neural audio codec with low latency for RTC that takes an encoder-temporal filteringdecoder paradigm that seldom being investigated in audio coding.
End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding
TLDR
This paper studies an end-to-end optimization methodology to optimize all modules in a codec integrally with respect to each other while capturing all these complex interactions with a global loss function.
A Comparative Study of Speech Coding Techniques for Electro Larynx Speech Production
TLDR
Comparisons of selected coding methods for speech signal produced by Electro Larynx (EL) device indicate that PVWT and ACELP coders perform better than other methods having about 40 dB SNR and 3 PESQ score for EL speech and  75 dB with 3.5 PESZ score for normal speech, respectively.
...
1
2
...

References

SHOWING 1-10 OF 33 REFERENCES
Robust Low Rate Speech Coding Based on Cloned Networks and Wavenet
TLDR
This work presents a new speech-coding scheme that is based on features that are robust to the distortions occurring in speech- coder input signals, and is additionally robust to noisy input.
Wavenet Based Low Rate Speech Coding
TLDR
This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Source Coding of Audio Signals with a Generative Model
TLDR
This work uses SampleRNN as the generative model and demonstrates that the proposed coding structure provides performance competitive with state-of-the-art source coding tools for specific categories of audio signals.
High-quality Speech Coding with Sample RNN
We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Efficient Neural Audio Synthesis
TLDR
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder
TLDR
This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.
Performance Study of a Convolutional Time-Domain Audio Separation Network for Real-Time Speech Denoising
TLDR
It is shown that a large part of the increase in performance between a causal and non-causal model is achieved with a lookahead of only 20 milliseconds, demonstrating the usefulness of even small lookaheads for many real-time applications.
A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet
TLDR
It is demonstrated that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPC net can exceed the quality of a waveform codec operating at low bitrate, opening the way for new codec designs based on neural synthesis models.
...
1
2
3
4
...