Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

  title={Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding},
  author={Kai Zhen and Mi Suk Lee and Jongmo Sung and Seung-Wha Beack and Minje Kim},
  journal={IEEE Signal Processing Letters},
Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding… Expand

Figures and Tables from this paper

Scalable and Efficient Neural Speech Coding
This work presents a scalable and efficient neural waveform codec (NWC) for speech compression and employs the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. Expand
HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding
A novel autoencoder architecture that improves the architectural scalability of general-purpose neural audio coding models and implements additional skip connections in the form of additional autoencoders, each of which is a small codec that compresses the massive data transfer between the paired encoderdecoder layers. Expand
Enhancing into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech out-put. However, theseExpand
Scalable and Efficient Neural Speech Coding: A Hybrid Design
  • Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beak, Minje Kim
  • Engineering, Computer Science
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2021
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN)Expand


Perceptual coding of digital audio
This paper reviews methodologies that achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms that manipulate transform components, subband signal decompositions, sinusoidal signal components, and linear prediction parameters, as well as hybrid algorithms that make use of more than one signal model. Expand
On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising
The more perceptually sensible enhancement in performance seen by simple neural network topologies proves that the proposed method can lead to resource-efficient speech denoising implementations in small devices without degrading the perceived signal fidelity. Expand
Spatial Audio Coding without Recourse to Background Signal Compression
  • S. Zamani, K. Rose
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed coding architecture overcomes the first two concerns by performing all compression in the SVD domain with a masking threshold that is calculated jointly for all encoded components, thereby accounting for cross-component masking. Expand
A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality
Two disturbance terms, which account for distortion once auditory masking and threshold effects are factored in, amend the mean square error (MSE) loss function by introducing perceptual criteria based on human psychoacoustics. Expand
Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder
This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. Expand
Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding
A cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules in a two-phase training scheme, showing better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. Expand
Vector quantization: A pattern-matching technique for speech coding
Recent results obtained in waveform coding of speech with vector quantization are reviewed, with Vector quantization appearing to be a suitable coding technique which caters to this dual requirement of effective speech coding. Expand
High-quality Speech Coding with Sample RNN
We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality ofExpand
Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization
This work proposes a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals and demonstrates that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. Expand
A Perceptual Weighting Filter Loss for DNN Training In Speech Enhancement
The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. Expand