Generative Speech Coding with Predictive Variance Regularization
@article{Kleijn2021GenerativeSC, title={Generative Speech Coding with Predictive Variance Regularization}, author={W. Kleijn and Andrew Storus and Michael Chinen and Tom Denton and Felicia S. C. Lim and Alejandro Luebs and Jan Skoglund and Hengchin Yeh}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={6478-6482} }
The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We…
15 Citations
Variable-rate discrete representation learning
- Computer ScienceArXiv
- 2021
This work proposes slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and shows that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction.
Parallel Synthesis for Autoregressive Speech Generation
- Computer ScienceArXiv
- 2022
Compared with the baseline autoregressive and non-autoregressive models, the proposed model achieves better MOS and shows its good generalization ability while synthesizing 44 kHz speech or utterances from unseen speakers.
A Streamwise Gan Vocoder for Wideband Speech Coding at Very Low Bit Rate
- Computer Science2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2021
A GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s and significantly outperforms prior autoregressive vocoders for very low bit rate speech coding, with computational complexity of about 5 GMACs, providing a new state of the art in this domain.
TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN
- Computer ScienceICMI Companion
- 2021
An end-to-end neural generative codec with a VQ-VAE based auto-encoder and the generative adversarial network (GAN), which achieves reconstructed speech with high-fidelity at a low bit-rate about 2 kb/s.
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
- Computer ScienceArXiv
- 2022
This work proposes a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR that leverages audio- visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals.
End-to-End Neural Speech Coding for Real-Time Communications
- Computer Science
- 2022
The TFNet, an end-to-end neural speech codec with low latency for RTC takes an encoder-temporal filteringdecoder paradigm that has seldom been investigated in audio coding to capture both short-term and long-term temporal dependencies.
Speech Localization at Low Bitrates in Wireless Acoustics Sensor Networks
- Computer ScienceFrontiers in Signal Processing
- 2022
This work analyzes a deep neural network (DNN) based framework performance as a function of the audio encoding bitrate for compressed signals by employing recent communication codecs including PyAWNeS, Opus, EVS, and Lyra and shows that for the best accuracy of the trained model, it is optimal to have the raw data for the second channel.
End-to-End Neural Audio Coding for Real-Time Communications
- Computer ScienceArXiv
- 2022
Both subjective and objective results demonstrate the efficiency of the proposed TFNet, an end-to-end neural audio codec with low latency for RTC that takes an encoder-temporal filteringdecoder paradigm that seldom being investigated in audio coding.
End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding
- Computer ScienceInterspeech
- 2021
This paper studies an end-to-end optimization methodology to optimize all modules in a codec integrally with respect to each other while capturing all these complex interactions with a global loss function.
A Comparative Study of Speech Coding Techniques for Electro Larynx Speech Production
- Computer ScienceIraqi Journal of Information and Communication Technology
- 2022
Comparisons of selected coding methods for speech signal produced by Electro Larynx (EL) device indicate that PVWT and ACELP coders perform better than other methods having about 40 dB SNR and 3 PESQ score for EL speech and 75 dB with 3.5 PESZ score for normal speech, respectively.
References
SHOWING 1-10 OF 33 REFERENCES
Robust Low Rate Speech Coding Based on Cloned Networks and Wavenet
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work presents a new speech-coding scheme that is based on features that are robust to the distortions occurring in speech- coder input signals, and is additionally robust to noisy input.
Wavenet Based Low Rate Speech Coding
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.
WaveNet: A Generative Model for Raw Audio
- Computer ScienceSSW
- 2016
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Source Coding of Audio Signals with a Generative Model
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work uses SampleRNN as the generative model and demonstrates that the proposed coding structure provides performance competitive with state-of-the-art source coding tools for specific categories of audio signals.
High-quality Speech Coding with Sample RNN
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of…
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Efficient Neural Audio Synthesis
- Computer ScienceICML
- 2018
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.
Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.
Performance Study of a Convolutional Time-Domain Audio Separation Network for Real-Time Speech Denoising
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
It is shown that a large part of the increase in performance between a causal and non-causal model is achieved with a lookahead of only 20 milliseconds, demonstrating the usefulness of even small lookaheads for many real-time applications.
A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet
- Computer ScienceINTERSPEECH
- 2019
It is demonstrated that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPC net can exceed the quality of a waveform codec operating at low bitrate, opening the way for new codec designs based on neural synthesis models.