BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

@article{Leng2022BinauralGradAT,
  title={BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis},
  author={Yichong Leng and Zehua Chen and Junliang Guo and Haohe Liu and Jiawei Chen and Xu Tan and Danilo P. Mandic and Lei He and Xiang-Yang Li and Tao Qin and Sheng Zhao and Tie-Yan Liu},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.14807}
}
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we… 

Figures and Tables from this paper

Neural Fourier Shift for Binaural Speech Rendering

Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space, is proposed and the experimental results show that NFS outperforms the previous studies on the benchmark dataset.

Binaural Rendering of Ambisonic Signals by Neural Networks

Experimental re-sults show that neural networks outperform the conventional method in objective metrics and achieve comparable subjective metrics in the end-to-end manner.

A N C ONTEXT -A WARE I NTELLIGENT S YSTEM TO A UTOMATE THE C ONVERSION OF 2D A UDIO TO 3D A UDIO USING S IGNAL P ROCESSING AND M ACHINE L EARNING

This paper proposes and implements the following approaches to design immersive audio experiences that fully exploit the abilities of 3D audio and shows that the approach was able to successfully separate the stems and simulate a dimensional sound effect.

A Survey on Generative Diffusion Model

A diverse range of advanced techniques to speed up the diffusion models – training schedule, training-free sampling, mixed-modeling, and score & diffusion unification are presented.

Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

A deep learning method based on a novel conditioning architecture that predicts an HRTF of the desired position from neigh-boring known HRTFs with their corresponding positions and anthropometric measurements and is quantitatively and perceptually more accurate than linear interpolation.

References

SHOWING 1-10 OF 48 REFERENCES

Neural Synthesis of Binaural Speech From Mono Audio

Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space

In this tutorial, head-related transfer functions (HRTFs) are introduced and treated with respect to their role in the synthesis of spatial sound over headphones. HRTFs are formally defined, and are

Natural Sound Rendering for Headphones: Integration of signal processing techniques

This tutorial article presents signal processing techniques to tackle the challenges to assist human listening in multimedia and virtual reality applications.

3-D Sound for Virtual Reality and Multimedia Cambridge

Virtual auditory space - context, acoustics and psychoacoustics overview of spatial hearing azimuth and elevation perception sound source distance and environmental context implementing 3-D sound

Rendering localized spatial audio in a virtual auditory space

This work uses a novel way of personalizing the head related transfer functions (HRTFs) from a database, based on anatomical measurements, to create virtual auditory spaces by rendering cues that arise from anatomical scattering, environmental scattering, and dynamical effects.

Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks

This work proposes a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning that demonstrates robustness in estimation, even under low signal-to-noise ratios, and shows strong results when learning from spatio-temporal real-world speech data.

DiffWave: A Versatile Diffusion Model for Audio Synthesis

DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

WaveGrad: Estimating Gradients for Waveform Generation

WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.

Denoising Diffusion Probabilistic Models

High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

Building and Evaluation of a Real Room Impulse Response Dataset

It is shown that a limited number of real R IRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results.