Visually Informed Binaural Audio Generation without Binaural Audios

@article{Xu2021VisuallyIB,
  title={Visually Informed Binaural Audio Generation without Binaural Audios},
  author={Xudong Xu and Hang Zhou and Ziwei Liu and Bo Dai and Xiaogang Wang and Dahua Lin},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={15480-15489}
}
  • Xudong Xu, Hang Zhou, Dahua Lin
  • Published 13 April 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Stereophonic audio, especially binaural audio, plays an essential role in immersive viewing environments. Recent research has explored generating visually guided stereophonic audios supervised by multi-channel audio collections. However, due to the requirement of professional recording devices, existing datasets are limited in scale and variety, which impedes the generalization of supervised methods in real-world scenarios. In this work, we propose PseudoBinaural, an effective pipeline that is… 

Figures and Tables from this paper

Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention
TLDR
It is argued that depth map of the scene can act as a proxy for inducing distance information of different objects in the scene, for the task of audio binauralization, and a novel encoder-decoder architecture with a hierarchical attention mechanism to encode image, depth and audio feature jointly is proposed.
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video
TLDR
This work develops a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time.
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis
TLDR
The proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples and outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds
TLDR
This work develops an approach for scene understanding purely based on binaural sounds that employs a cross-modal distillation framework that consists of multiple vision teacher methods and a sound student method that is trained to generate the same results as the teacher methods do.
Binaural audio generation via multi-task learning
TLDR
A learning-based approach for generating binaural audio from mono audio using multi-task learning that optimizes the overall loss based on the weighted sum of the losses of the two tasks.
Visual Sound Localization in the Wild by Cross-Modal Interference Erasing
TLDR
Quantitative and qualitative evaluations demonstrate that the Interference Eraser framework achieves superior results on sound localization tasks, especially under real world scenarios.
Unsupervised Sound Localization via Iterative Contrastive Learning
TLDR
An iterative contrastive learning framework that requires no data annotations for sound localization and gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
Few-Shot Audio-Visual Learning of Environment Acoustics
TLDR
A transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-att attention is introduced, and it is demonstrated that this method successfully generates arbitrary R IRs, outperforming state-of-the-art methods and—in a major departure from traditional methods—generalizing to novel environments in a few-shot manner.
Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation
TLDR
This paper proposes a cyclic co-learning (CCoL) paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation in a unified framework and improves training example sampling for sounding object grounding, which builds a co- learning cycle for the two tasks and makes them mutually beneficial.
Sound Localization by Self-Supervised Time Delay Estimation
TLDR
This work adapts the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on “in the wild” internet recordings.
...
...

References

SHOWING 1-10 OF 52 REFERENCES
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
TLDR
This work integrates both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization, and proposes a novel associative pyramid network architecture carefully designed for audio-visual feature fusion.
Vision-Infused Deep Audio Inpainting
TLDR
This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI).
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.
Self-Supervised Audio Spatialization with Correspondence Classifier
TLDR
This work proposes a self-supervised audio spatialization network that can generate spatial audio given the corresponding video and monaural audio and uses an auxiliary classifier to classify ground-truth videos and those with audio where the left and right channels are swapped.
Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
TLDR
This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds, using a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method to generate the same results as the teacher method.
Self-Supervised Generation of Spatial Audio for 360 Video
TLDR
This work introduces an approach to convert mono audio recorded by a 360° video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere, and shows that it is possible to infer the spatial localization of sounds based only on a synchronized360° video and the mono audio track.
Scene-aware audio for 360° videos
TLDR
This work proposes a method that synthesizes the directional impulse response between any source and listening locations by combining a synthesized early reverberation part and a measured late reverberation tail, and demonstrates the strength of the method in several applications.
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Learning to Separate Object Sounds by Watching Unlabeled Video
TLDR
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
TLDR
This work proposes a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream, and demonstrates that understanding spatial correspondence enables models to perform better on three audio-visual tasks.
...
...