BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

  title={BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis},
  author={Yichong Leng and Zehua Chen and Junliang Guo and Haohe Liu and Jiawei Chen and Xu Tan and Danilo P. Mandic and Lei He and Xiang-Yang Li and Tao Qin and Sheng Zhao and Tie-Yan Liu},
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we… 

Figures and Tables from this paper

DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect

Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual reali-ties. Binaural audio helps us to orient our-selves and establish

Neural Fourier Shift for Binaural Speech Rendering

Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space, is proposed and the experimental results show that NFS outperforms the previous studies on the benchmark dataset.

ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models

This work constructs an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector and leverages a Lagrange interpolation function as the predictor, which is further enhanced with anerror-Robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise.

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents, achieves state-of-the-art TTA performance measured by both objective and subjective metrics.

Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion

A cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions is developed, targeting real-time on a single consumer GPU.

DiffSketching: Sketch Control Image Synthesis with Diffusion Models

This paper presents a deep learning model that can beat GAN-based method in terms of generation quality and human evaluation, and does not rely on massive sketch-image datasets.

Binaural Rendering of Ambisonic Signals by Neural Networks

Experimental re-sults show that neural networks outperform the conventional method in objective metrics and achieve comparable subjective metrics in the end-to-end manner.


This paper proposes and implements the following approaches to design immersive audio experiences that fully exploit the abilities of 3D audio and shows that the approach was able to successfully separate the stems and simulate a dimensional sound effect.

A Survey on Generative Diffusion Model

A diverse range of advanced techniques to speed up the diffusion models – training schedule, training-free sampling, mixed-modeling, and score & diffusion unification are presented.

Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features

A deep learning method based on a novel conditioning architecture that predicts an HRTF of the desired position from neigh-boring known HRTFs with their corresponding positions and anthropometric measurements and is quantitatively and perceptually more accurate than linear interpolation.



Neural Synthesis of Binaural Speech From Mono Audio

Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space

In this tutorial, head-related transfer functions (HRTFs) are introduced and treated with respect to their role in the synthesis of spatial sound over headphones. HRTFs are formally defined, and are

Natural Sound Rendering for Headphones: Integration of signal processing techniques

This tutorial article presents signal processing techniques to tackle the challenges to assist human listening in multimedia and virtual reality applications.

Rendering localized spatial audio in a virtual auditory space

This work uses a novel way of personalizing the head related transfer functions (HRTFs) from a database, based on anatomical measurements, to create virtual auditory spaces by rendering cues that arise from anatomical scattering, environmental scattering, and dynamical effects.

3-D Sound for Virtual Reality and Multimedia Cambridge

Virtual auditory space - context, acoustics and psychoacoustics overview of spatial hearing azimuth and elevation perception sound source distance and environmental context implementing 3-D sound

Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks

This work proposes a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning that demonstrates robustness in estimation, even under low signal-to-noise ratios, and shows strong results when learning from spatio-temporal real-world speech data.

DiffWave: A Versatile Diffusion Model for Audio Synthesis

DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

WaveGrad: Estimating Gradients for Waveform Generation

WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.

Denoising Diffusion Probabilistic Models

High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

Building and Evaluation of a Real Room Impulse Response Dataset

It is shown that a limited number of real R IRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results.