Multiple Sound Sources Localization from Coarse to Fine

@article{Qian2020MultipleSS,
  title={Multiple Sound Sources Localization from Coarse to Fine},
  author={Rui Qian and Di Hu and Heinrich Dinkel and Mengyue Wu and Ning Xu and Weiyao Lin},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.06355}
}
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner. Our model achieves state-of-the-art results on public dataset of localization, as well as… 
Dual Normalization Multitasking for Audio-Visual Sounding Object Localization
TLDR
A novel multitask training strategy and architecture called Dual Normalization Multitasking (DNM), which aggregates the Audio-Visual Correspondence (AVC) task and the classification task for video events into a single audiovisual similarity map is proposed.
Unsupervised Sound Localization via Iterative Contrastive Learning
TLDR
An iterative contrastive learning framework that requires no data annotations for sound localization and gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching
TLDR
Experimental results in both realistic and synthesized cocktail-party videos demonstrate that the proposed two-stage learning framework is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
Localizing Visual Sounds the Hard Way
TLDR
The key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, the authors can significantly boost the localization performance by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically.
Class-aware Sounding Objects Localization via Audiovisual Correspondence
  • Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen
  • Computer Science, Medicine
    IEEE transactions on pattern analysis and machine intelligence
  • 2021
TLDR
A two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision, which is superior in localizing and recognizing objects as well as filtering out silent ones.
Space-Time Memory Network for Sounding Object Localization in Videos
  • Sizhe Li, Yapeng Tian, Chenliang Xu
  • Computer Science
    ArXiv
  • 2021
TLDR
This work proposes a space-time memory network for sounding object localization in videos that can simultaneously learn spatio-temporal attention over both uni- modal and cross-modal representations from audio and visual modalities and demonstrates that it generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.
Localize to Binauralize: Audio Spatialization from Visual Sound Source Localization
Videos with binaural audios provide immersive viewing experience by enabling 3D sound sensation. Recent works attempt to generate binaural audio in a multimodal learning framework using large
Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
TLDR
This work proposes a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results and localizes active speakers from all possible directions on the sphere, even outside the camera’s field of view, while simultaneously detecting the device wearer's own voice activity.
Structure from Silence: Learning Scene Structure from Ambient Sound
TLDR
It is suggested that ambient sound conveys a surprising amount of information about scene structure, and that it is a useful signal for learning multimodal features.
Audio-Visual Localization by Synthetic Acoustic Image Generation
TLDR
This work proposes to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization, using a novel deep architecture trained to reconstruct the ground truth spatialized audio data collected by a microphone array from the associated video and its corresponding monaural audio signal.
...
1
2
3
...

References

SHOWING 1-10 OF 36 REFERENCES
Learning to Localize Sound Source in Visual Scenes
TLDR
A novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes, and a two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization.
Learning to Separate Object Sounds by Watching Unlabeled Video
TLDR
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Self-supervised Audio-visual Co-segmentation
TLDR
This paper develops a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision, and introduces a learning approach to disentangle concepts in the neural networks.
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science, Engineering
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.
The Sound of Pixels
TLDR
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Curriculum Audiovisual Learning
TLDR
A flexible audiovisual model is presented that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovISual concurrency as the latent supervision for inferring the correlation among detected contents.
2.5D Visual Sound
  • Ruohan Gao, K. Grauman
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
A deep convolutional neural network is devised that learns to decode the monaural soundtrack into its binaural counterpart by injecting visual information about object and scene configurations, and the resulting output 2.5D visual sound helps "lift" the flat single channel audio into spatialized sound.
Self-Supervised Moving Vehicle Tracking With Stereo Sound
TLDR
This work proposes a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time, and demonstrates that the proposed approach outperforms several baseline approaches.
The Sound of Motions
TLDR
Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.
...
1
2
3
4
...