BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis
@article{Leng2022BinauralGradAT, title={BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis}, author={Yichong Leng and Zehua Chen and Junliang Guo and Haohe Liu and Jiawei Chen and Xu Tan and Danilo P. Mandic and Lei He and Xiang-Yang Li and Tao Qin and Sheng Zhao and Tie-Yan Liu}, journal={ArXiv}, year={2022}, volume={abs/2205.14807} }
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we…
Figures and Tables from this paper
10 Citations
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect
- Physics
- 2022
Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual reali-ties. Binaural audio helps us to orient our-selves and establish…
Neural Fourier Shift for Binaural Speech Rendering
- Computer ScienceArXiv
- 2022
Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space, is proposed and the experimental results show that NFS outperforms the previous studies on the benchmark dataset.
ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models
- Computer Science
- 2023
This work constructs an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector and leverages a Lagrange interpolation function as the predictor, which is further enhanced with anerror-Robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise.
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
- Computer ScienceArXiv
- 2023
AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents, achieves state-of-the-art TTA performance measured by both objective and subjective metrics.
Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion
- Computer Science
- 2023
A cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions is developed, targeting real-time on a single consumer GPU.
DiffSketching: Sketch Control Image Synthesis with Diffusion Models
- Computer Science
- 2022
This paper presents a deep learning model that can beat GAN-based method in terms of generation quality and human evaluation, and does not rely on massive sketch-image datasets.
Binaural Rendering of Ambisonic Signals by Neural Networks
- Computer ScienceArXiv
- 2022
Experimental re-sults show that neural networks outperform the conventional method in objective metrics and achieve comparable subjective metrics in the end-to-end manner.
A N C ONTEXT -A WARE I NTELLIGENT S YSTEM TO A UTOMATE THE C ONVERSION OF 2D A UDIO TO 3D A UDIO USING S IGNAL P ROCESSING AND M ACHINE L EARNING
- Computer Science
- 2022
This paper proposes and implements the following approaches to design immersive audio experiences that fully exploit the abilities of 3D audio and shows that the approach was able to successfully separate the stems and simulate a dimensional sound effect.
A Survey on Generative Diffusion Model
- Computer ScienceArXiv
- 2022
A diverse range of advanced techniques to speed up the diffusion models – training schedule, training-free sampling, mixed-modeling, and score & diffusion unification are presented.
Global HRTF Interpolation via Learned Affine Transformation of Hyper-conditioned Features
- Computer ScienceArXiv
- 2022
A deep learning method based on a novel conditioning architecture that predicts an HRTF of the desired position from neigh-boring known HRTFs with their corresponding positions and anthropometric measurements and is quantitatively and perceptually more accurate than linear interpolation.
References
SHOWING 1-10 OF 48 REFERENCES
Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency, and Space
- Physics
- 2001
In this tutorial, head-related transfer functions (HRTFs) are introduced and treated with respect to their role in the synthesis of spatial sound over headphones. HRTFs are formally defined, and are…
Natural Sound Rendering for Headphones: Integration of signal processing techniques
- Computer ScienceIEEE Signal Processing Magazine
- 2015
This tutorial article presents signal processing techniques to tackle the challenges to assist human listening in multimedia and virtual reality applications.
Rendering localized spatial audio in a virtual auditory space
- Computer ScienceIEEE Transactions on Multimedia
- 2004
This work uses a novel way of personalizing the head related transfer functions (HRTFs) from a database, based on anatomical measurements, to create virtual auditory spaces by rendering cues that arise from anatomical scattering, environmental scattering, and dynamical effects.
3-D Sound for Virtual Reality and Multimedia Cambridge
- Physics
- 1994
Virtual auditory space - context, acoustics and psychoacoustics overview of spatial hearing azimuth and elevation perception sound source distance and environmental context implementing 3-D sound…
Deep Impulse Responses: Estimating and Parameterizing Filters with Deep Networks
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
This work proposes a novel framework for parameterizing and estimating impulse responses based on recent advances in neural representation learning that demonstrates robustness in estimation, even under low signal-to-noise ratios, and shows strong results when learning from spatio-temporal real-world speech data.
DiffWave: A Versatile Diffusion Model for Audio Synthesis
- Computer ScienceICLR
- 2021
DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
WaveGrad: Estimating Gradients for Waveform Generation
- Computer ScienceICLR
- 2021
WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.
Denoising Diffusion Probabilistic Models
- Computer ScienceNeurIPS
- 2020
High quality image synthesis results are presented using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics, which naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.
Building and Evaluation of a Real Room Impulse Response Dataset
- Computer ScienceIEEE Journal of Selected Topics in Signal Processing
- 2019
It is shown that a limited number of real R IRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results.