End-to-End Neural Speech Coding for Real-Time Communications

@inproceedings{Jiang2022EndtoEndNS,
  title={End-to-End Neural Speech Coding for Real-Time Communications},
  author={Xue Jiang and Xiulian Peng and Chengyu Zheng and Huaying Xue and Yuan Zhang and Yan Lu},
  booktitle={ICASSP},
  year={2022}
}
Deep-learning based methods have shown their advantages in audio coding over traditional ones but limited attention has been paid on real-time communications (RTC). This paper proposes the TFNet, an end-to-end neural speech codec with low latency for RTC. It takes an encoder-temporal filteringdecoder paradigm that has seldom been investigated in audio coding. An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies. Furthermore… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 24 REFERENCES
Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
TLDR
A cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules in a two-phase training scheme, showing better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture.
A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement
TLDR
This paper incorporates a convolutional encoderdecoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
SoundStream: An End-to-End Neural Audio Codec
TLDR
A novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs and perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency is presented.
High-quality Speech Coding with Sample RNN
We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of
TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain
  • Ashutosh Pandey, Deliang Wang
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Experimental results demonstrate that the proposed model gives consistently better enhancement results than a state-of-the-art real-time convolutional recurrent model.
LPCNET: Improving Neural Speech Synthesis through Linear Prediction
  • J. Valin, J. Skoglund
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.
Differentiable Consistency Constraints for Improved Deep Speech Enhancement
  • Scott Wisdom, J. Hershey, R. Saurous
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This paper presents a new approach to masking that applies mixture consistency to complex-valued short-time Fourier transforms (STFTs) using real-valued masks, and shows that this approach can be effective in speech enhancement.
Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder
TLDR
This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.
Interactive Speech and Noise Modeling for Speech Enhancement
TLDR
This paper proposes a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net, and designs a feature extraction module, namely residual-convolution-and-attention (RA), to capture the correlations along temporal and frequency dimensions for both the speech and the noises.
...
...