SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

  title={SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning},
  author={Changan Chen and Carl Schissler and Sanchit Garg and Philip Kobernik and Alexander Clegg and Paul T. Calamia and Dhruv Batra and Philip Robinson and Kristen Grauman},
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, mapping, source localization and separation, and acoustic matching. Compared to existing resources… 

Figures and Tables from this paper


SoundSpaces: Audio-Visual Navigation in 3D Environments
This work proposes a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to discover elements of the geometry of the physical space indicated by the reverberating audio and detect and follow sound-emitting targets.
Learning to Set Waypoints for Audio-Visual Navigation
This work introduces a reinforcement learning approach to audio-visual navigation with two key novel elements: waypoints that are dynamically set and learned end-to-end within the navigation policy, and an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves.
Visual Acoustic Matching
This work proposes a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output, and demonstrates that this approach successfully translates human speech to a variety of real-world environments depicted in images.
Semantic Audio-Visual Navigation
This work proposes a transformer-based model, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target, and strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.
Move2Hear: Active Audio-Visual Source Separation
This work introduces the active audio-visual source separation problem, and introduces a reinforcement learning approach that trains movement policies controlling the agent’s camera and microphone placement over time, guided by the improvement in predicted audio separation quality.
GWA: A Large High-Quality Acoustic Dataset for Audio Processing
The Geometric-Wave Acoustic (GWA) dataset is presented, a large-scale audio dataset of about 2 million synthetic room impulse responses (IRs) and their corresponding detailed geometric and simulation configurations that is the first data with accurate wave acoustic simulations in complex scenes.
Learning Neural Acoustic Fields
Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary’s wide open space. Similarly,
Scene-Aware Audio Rendering via Deep Acoustic Analysis
A new method to capture the acoustic characteristics of real-world rooms using commodity devices, and use the captured characteristics to generate similar sounding sources with virtual models, based on deep neural networks.
Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis
This work uses an end-to-end neural network architecture to generate plausible audio impulse responses from single images of acoustic environments using convolution, simulating the reverberant characteristics of the space shown in the image.
Looking to listen at the cocktail party
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.