Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

  title={Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video},
  author={Rishabh Garg and Ruohan Gao and Kristen Grauman},
  booktitle={British Machine Vision Conference},
Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process… 

Figures and Tables from this paper

PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection

This work proposes a new task, panoramic audiovisual salient object detection (PAV-SOD), which aims to segment the objects grasping most of the human attention in 360°panoramic videos reflecting real-life daily scenes, and proposes anew baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE).

Sound Localization by Self-Supervised Time Delay Estimation

This work adapts the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on “in the wild” internet recordings.

Room Acoustic Properties Estimation from a Single 360° Photo

Estimating room impulse responses (RIRs) in real spaces is a time-consuming and expensive process requiring multiple pieces of equipment, recordings, and processing. A simple computer-vision-based

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

A robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model outperforms prior methods.



2.5D Visual Sound

  • Ruohan GaoK. Grauman
  • Physics, Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
A deep convolutional neural network is devised that learns to decode the monaural soundtrack into its binaural counterpart by injecting visual information about object and scene configurations, and the resulting output 2.5D visual sound helps "lift" the flat single channel audio into spatialized sound.

Visually Informed Binaural Audio Generation without Binaural Audios

This work proposes PseudoBinaural, an effective pipeline that is free of binaural recordings, and shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.

Learning to Separate Object Sounds by Watching Unlabeled Video

This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.

Co-Separating Sounds of Visual Objects

  • Ruohan GaoK. Grauman
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.

Learning Representations from Audio-Visual Spatial Alignment

A novel self-supervised pretext task for learning representations from audio-visual content using a transformer architecture to combine representations from multiple viewpoints and the ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360{\deg} video.

Vision-Infused Deep Audio Inpainting

This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI).

Self-Supervised Generation of Spatial Audio for 360 Video

This work introduces an approach to convert mono audio recorded by a 360° video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere, and shows that it is possible to infer the spatial localization of sounds based only on a synchronized360° video and the mono audio track.

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

This work proposes a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream, and demonstrates that understanding spatial correspondence enables models to perform better on three audio-visual tasks.

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels, and the proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.