Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

@article{Hu2020DiscriminativeSO,
  title={Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching},
  author={Di Hu and Rui Qian and Minyue Jiang and Xiao Tan and Shilei Wen and Errui Ding and Weiyao Lin and Dejing Dou},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.05466}
}
Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party… 
Class-aware Sounding Objects Localization via Audiovisual Correspondence
  • Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen
  • Computer Science, Medicine
    IEEE transactions on pattern analysis and machine intelligence
  • 2021
TLDR
A two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision, which is superior in localizing and recognizing objects as well as filtering out silent ones.
Dual Normalization Multitasking for Audio-Visual Sounding Object Localization
TLDR
A novel multitask training strategy and architecture called Dual Normalization Multitasking (DNM), which aggregates the Audio-Visual Correspondence (AVC) task and the classification task for video events into a single audiovisual similarity map is proposed.
Unsupervised Sound Localization via Iterative Contrastive Learning
TLDR
An iterative contrastive learning framework that requires no data annotations for sound localization and gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
Self-supervised object detection from audio-visual correspondence
TLDR
This work extracts a supervisory signal from audio-visual data, using the audio component to “teach” the object detector, and outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization.
Space-Time Memory Network for Sounding Object Localization in Videos
  • Sizhe Li, Yapeng Tian, Chenliang Xu
  • Computer Science
    ArXiv
  • 2021
TLDR
This work proposes a space-time memory network for sounding object localization in videos that can simultaneously learn spatio-temporal attention over both uni- modal and cross-modal representations from audio and visual modalities and demonstrates that it generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video
TLDR
This work develops a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time.
Audio-Visual Localization by Synthetic Acoustic Image Generation
TLDR
This work proposes to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization, using a novel deep architecture trained to reconstruct the ground truth spatialized audio data collected by a microphone array from the associated video and its corresponding monaural audio signal.
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos
TLDR
This paper proposes a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video by leveraging visual, audio and face information, and shows that the proposed method outperforms 12 state-of-the-artSaliency prediction methods, and achieves competitive results in sound sources localization.
Visually Informed Binaural Audio Generation without Binaural Audios
TLDR
This work proposes PseudoBinaural, an effective pipeline that is free of binaural recordings, and shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
  • Yidi Li, Hong Liu, Hao Tang
  • Computer Science
    ArXiv
  • 2021
TLDR
A novel Multi-modal Perception Tracker for speaker tracking using both audio and visual modalities is proposed, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.
...
1
2
3
...

References

SHOWING 1-10 OF 34 REFERENCES
Learning to Localize Sound Source in Visual Scenes
TLDR
A novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes, and a two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization.
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science, Engineering
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.
Multiple Sound Sources Localization from Coarse to Fine
TLDR
A two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner achieves state-of-the-art results on public dataset of localization, as well as considerable performance on multi-source sound localization in complex scenes.
Self-Supervised Moving Vehicle Tracking With Stereo Sound
TLDR
This work proposes a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time, and demonstrates that the proposed approach outperforms several baseline approaches.
Self-taught object localization with deep networks
This paper introduces self-taught object localization, a novel approach that leverages deep convolutional networks trained for whole-image recognition to localize objects in images without additional
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Curriculum Audiovisual Learning
TLDR
A flexible audiovisual model is presented that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovISual concurrency as the latent supervision for inferring the correlation among detected contents.
Deep Multimodal Clustering for Unsupervised Audiovisual Learning
  • Di Hu, F. Nie, Xuelong Li
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
A novel unsupervised audiovisual learning model is proposed, named as Deep Multimodal Clustering (DMC), that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovISual correspondences and can be effectively trained with max-margin loss in the end-to-end fashion.
Objects that Sound
TLDR
New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image.
Look, Listen and Learn
TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
...
1
2
3
4
...