• Corpus ID: 232478710

Unsupervised Sound Localization via Iterative Contrastive Learning

@article{Lin2021UnsupervisedSL,
  title={Unsupervised Sound Localization via Iterative Contrastive Learning},
  author={Yan-Bo Lin and Hung-Yu Tseng and Hsin-Ying Lee and Yen-Yu Lin and Ming-Hsuan Yang},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00315}
}
Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the 1) localization results in images predicted in the… 

Figures and Tables from this paper

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

Self-Supervised Predictive Learning is proposed, a negative-free method for sound localization via explicit positive mining and a novel predictive coding module for audio-visual feature alignment is introduced, leading to semantically coherent similarities between audio and visual features.

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

The proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications, and systematically investigates the effects of data augmentations to understand what enables to learn useful representations.

Learning Sound Localization Better from Semantically Similar Samples

This work shows that hard positives can give similar response maps to the corresponding pairs in visual scenes, and incorporates these hard positives by adding their response maps into a contrastive learning objective directly.

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

Quantitative and qualitative evaluations demonstrate that the Interference Eraser framework achieves superior results on sound localization tasks, especially under real world scenarios.

Estimating Visual Information From Audio Through Manifold Learning

A new framework for extracting visual information about a scene only using audio signals, based on Manifold Learning, that considers the prediction of the following visual modalities from audio: depth and semantic segmentation.

Less Can Be More: Sound Source Localization With a Classification Model

The key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed "less is more".

Contrastive Learning of Global-Local Video Representations

This work proposes to learn video representations that generalize to both the tasks which require global semantic information and the tasks that require local fine-grained spatio-temporal information, by optimizing two contrastive objectives that together encourage the model to learn global-local visual information given audio signals.

Contrastive Learning of Global-Local Video Representations

This work proposes to learn video representations that generalize to both the tasks which require global semantic information and the tasks that require local fine-grained spatio-temporal information, by optimizing two contrastive objectives that together encourage the model to learn global-local visual information given audio signals.

Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention

This work introduces a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on- screen objects by looking at in the wild videos, using cross-modal and self-attention modules.

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

. We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by

References

SHOWING 1-10 OF 51 REFERENCES

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

This work presents a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes, and extends this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.

Learning to Localize Sound Source in Visual Scenes

A novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes, and a two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization.

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Experimental results in both realistic and synthesized cocktail-party videos demonstrate that the proposed two-stage learning framework is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

  • Di HuF. NieXuelong Li
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
A novel unsupervised audiovisual learning model is proposed, named as Deep Multimodal Clustering (DMC), that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovISual correspondences and can be effectively trained with max-margin loss in the end-to-end fashion.

Self-Supervised Learning of Audio-Visual Objects from Video

This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

Robust Audio-Visual Instance Discrimination

The contributions are validated through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance.

Learning to Separate Object Sounds by Watching Unlabeled Video

This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.

Co-Separating Sounds of Visual Objects

  • Ruohan GaoK. Grauman
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels, and the proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems.

Learning Representations from Audio-Visual Spatial Alignment

A novel self-supervised pretext task for learning representations from audio-visual content using a transformer architecture to combine representations from multiple viewpoints and the ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360{\deg} video.
...