• Corpus ID: 243938680

Structure from Silence: Learning Scene Structure from Ambient Sound

  title={Structure from Silence: Learning Scene Structure from Ambient Sound},
  author={Ziyang Chen and Xixi Hu and Andrew Owens},
From whirling ceiling fans to ticking clocks, the sounds that we hear subtly vary as we move through a scene. We ask whether these ambient sounds convey information about 3D scene structure and, if so, whether they provide a useful learning signal for multimodal models. To study this, we collect a dataset of paired audio and RGB-D recordings from a variety of quiet indoor scenes. We then train models that estimate the distance to nearby walls, given only audio as input. We also use these… 

Mix and Localize: Localizing Sound Sources in Mixtures

This work creates a graph in which images and separated sounds correspond to nodes, and trains a random walker to transition between nodes from different modalities with high return probability, determined by an audio-visual similarity metric that is learned by the model.

Learning Visual Styles from Audio-Visual Associations

This paper presents a method for learning visual styles from unlabeled audio-visual data that learns to manipulate the texture of a scene to match a sound, a problem the authors term audio-driven image stylization.

That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation

This work proposes a data-centric approach to dynamic manipulation that uses an often ignored source of information: sound, and indicates that when asked to generate desired sound behavior, online rollouts of the models on a UR10 robot can produce dynamic behavior that achieves an average of 11.5% improvement over supervised learning on audio similarity metrics.

Finding Fallen Objects Via Asynchronous Audio-Visual Integration

  • Chuang GanYi Gu A. Torralba
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
A set of embodied agent baselines are developed, based on imitation learning, reinforcement learning, and modular planning, and an in-depth analysis of the challenge of this new task is performed.

Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual Imitation Learning

A system that can complete a set of challenging, partially-observed tasks on a Franka Emika Panda robot, like extracting keys from a bag, with a 70% success rate, 50% higher than a policy that does not use audio.

Camera Pose Estimation and Localization with Active Audio Sensing

This work shows how to estimate a device’s position and orientation indoors by echolocation, i.e., by interpreting the echoes of an audio signal that the device itself emits, and proposes a strategy for learning an audio representation that captures the scene geometry around a device using supervision transfer from vision.



Ambient Sound Provides Supervision for Visual Learning

This work trains a convolutional neural network to predict a statistical summary of the sound associated with a video frame, and shows that this representation is comparable to that of other state-of-the-art unsupervised learning methods.

The Sound of Motions

Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.

SoundNet: Learning Sound Representations from Unlabeled Video

This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

Learning to Separate Object Sounds by Watching Unlabeled Video

This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.

The Sound of Pixels

Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

Multiple Sound Sources Localization from Coarse to Fine

A two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner achieves state-of-the-art results on public dataset of localization, as well as considerable performance on multi-source sound localization in complex scenes.

Learning to Set Waypoints for Audio-Visual Navigation

This work introduces a reinforcement learning approach to audio-visual navigation with two key novel elements: waypoints that are dynamically set and learned end-to-end within the navigation policy, and an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves.

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

This work proposes a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream, and demonstrates that understanding spatial correspondence enables models to perform better on three audio-visual tasks.

Audio-Visual Floorplan Reconstruction

AV-Map is introduced, a novel multi-modal encoder-decoder framework that reasons jointly about audio and vision to reconstruct a floorplan from a short input video sequence and is trained to predict both the interior structure of the environment and the associated rooms’ semantic labels.