Move2Hear: Active Audio-Visual Source Separation

  title={Move2Hear: Active Audio-Visual Source Separation},
  author={Sagnik Majumder and Ziad Al-Halah and Kristen Grauman},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and it must use its eyes and ears to automatically separate out the sounds originating from a target object within a limited time budget. Towards this goal, we introduce a reinforcement… 

Catch Me if You Hear Me: Audio-Visual Navigation in Complex Unmapped Environments With Moving Sounds

This work proposes an architecture that fuses audio-visual information in the spatial feature space to learn correlations of geometric information inherent in both local maps and audio signals, and demonstrates that this approach consistently outperforms the current state-of-the-art across all tasks of moving sounds, unheard sounds, and noisy environments.

Few-Shot Audio-Visual Learning of Environment Acoustics

A transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-att attention is introduced, and it is demonstrated that this method successfully generates arbitrary R IRs, outperforming state-of-the-art methods and—in a major departure from traditional methods—generalizing to novel environments in a few-shot manner.

SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

This work introduces SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments with the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties, and showcases the simulator’s properties and benchmark its performance against real-world audio measurements.

iQuery: Instruments as Queries for Audio-Visual Sound Separation

“visually named” queries are utilized to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms, and an additional query is inserted as an audio prompt while freezing the attention mechanism.

Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations

An audio-visual deep reinforcement learning approach that works with the authors' shared scene mapper to selectively turn on the camera to efficiently chart out the space, and achieves an excellent cost-accuracy tradeoff.

Mix and Localize: Localizing Sound Sources in Mixtures

This work creates a graph in which images and separated sounds correspond to nodes, and trains a random walker to transition between nodes from different modalities with high return probability, determined by an audio-visual similarity metric that is learned by the model.

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

This survey reviews and outlooks the current audio-visual learning from different aspects and hopes it can provide researchers with a better understanding of this area.

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

This work proposes SepFusion, a novel framework that can smoothly produce optimal fusion structures for visual-sound separation and provides a series of strong models for broader applications, such as further promoting performance via model assembly, or providing suitable architectures for the separation of certain instrument classes.

DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing

This paper proposes a novel AVVP framework termedDual Hierarchical Hybrid Network (DHHN), a hierarchical context modeling network for extracting different semantics in multiple temporal lengths that maintains the best adaptions on different modalities, further boosting the video parsing performance.

PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection

This work proposes a new task, panoramic audiovisual salient object detection (PAV-SOD), which aims to segment the objects grasping most of the human attention in 360°panoramic videos reflecting real-life daily scenes, and proposes anew baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE).



SoundSpaces: Audio-Visual Navigation in 3D Environments

This work proposes a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to discover elements of the geometry of the physical space indicated by the reverberating audio and detect and follow sound-emitting targets.

Learning to Set Waypoints for Audio-Visual Navigation

This work introduces a reinforcement learning approach to audio-visual navigation with two key novel elements: waypoints that are dynamically set and learned end-to-end within the navigation policy, and an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves.

Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

This paper attempts to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data.

Semantic Audio-Visual Navigation

This work proposes a transformer-based model, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target, and strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

Vision-guided robot hearing

This paper introduces a hybrid deterministic/probabilistic model that enables the visual features to guide the grouping of the auditory features in order to form audiovisual (AV) objects and performs experiments to investigate how vision and hearing could be further combined for robust HRI.

Looking to listen at the cocktail party

A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.

OtoWorld: Towards Learning to Separate by Learning to Move

OtoWorld is an interactive environment in which agents must learn to listen in order to solve navigational tasks, and preliminary results on the ability of agents to win at OtoWorld are presented.

An information based feedback control for audio-motor binaural localization

This paper determines an admissible motion of a binaural head which leads, on average, to the one-step-ahead most informative audio-motor localization.

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.