Self-supervised object detection from audio-visual correspondence

  title={Self-supervised object detection from audio-visual correspondence},
  author={Triantafyllos Afouras and Yuki M. Asano and Francois Fagan and Andrea Vedaldi and Florian Metze},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to “teach” the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the… 

Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection

The methods, which update noisy ground truth and provide indirect and attention paths, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning.

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

An autoregressive model is trained to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound, and obtains strong performance on the task of detecting manipulated speech videos.

Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

“Tragic Talkers” is presented, an audio-visual dataset consisting of excerpts from the “Romeo and Juliet” drama captured with microphone arrays and multiple co-located cameras for light-field video, designed to cover various conventional talking scenarios.

Egocentric Audio-Visual Noise Suppression

A multi-task learning framework that jointly optimizes audio-visual noise suppression and video based acoustic event detection is introduced that outperforms the audio only baseline on all metrics, including a 0.16 PESQ improvement.

Saliency Can Be All You Need In Contrastive Self-Supervised Learning

An augmentation policy for Contrastive Self-Supervised Learning is proposed in the form of an already established Salient Image Segmentation technique entitled Global Contrast based Salient Region Detection, which indicates that the proposed technique indeed contributes to SSL.

PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection

This work proposes a new task, panoramic audiovisual salient object detection (PAV-SOD), which aims to segment the objects grasping most of the human attention in 360°panoramic videos reflecting real-life daily scenes, and proposes anew baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE).

Temporal and cross-modal attention for audio-visual zero-shot learning

This work proposes a multi-modal and Temporal Cross-attention Framework (TCaF) for audio-visual generalised zero-shot learning and shows that the proposed framework that ingests temporal features yields state-of-the-art performance on the UCF-GZSL cls, VGGSound-G zSL, and ActivityNet-GzSL benchmarks for (generalised) zero- shot learning.

Self-Supervised Learning of Music-Dance Representation through Explicit-Implicit Rhythm Synchronization

MuDaR is introduced, a novel MuDaR learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways, and outperforms other self-supervised methods by a large margin.

Look, Radiate, and Learn: Self-supervised Localisation via Radio-Visual Correspondence

The results indicate that accurate radio target localisation can be automatically learned from paired radio-visual data without labels, which opens the door for vast data scalability and may prove key to realising the promise of robust radio sensing atop a unified communication-perception cellular infrastructure.

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

This work proposes to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process, and achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.



PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

This paper first shows that instances can be assigned object or background labels directly based on proposal clusters for instance classifier refinement, and then shows that treating each cluster as a small new bag yields fewer ambiguities than the directly assigning label method.

Vggsound: A Large-Scale Audio-Visual Dataset

The goal is to collect a large-scale audio-visual dataset with low label noise from videos ‘in the wild’ using computer vision techniques and investigates various Convolutional Neural Network architectures and aggregation approaches to establish audio recognition baselines for this new dataset.

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Labelling unlabelled videos from scratch with multi-modal self-supervision

It is shown that unsupervised labelling of a video dataset does not come for free from strong feature encoders and a novel clustering method is proposed that allows pseudo-labelling of the video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities.

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Experimental results in both realistic and synthesized cocktail-party videos demonstrate that the proposed two-stage learning framework is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.

Objects that Sound

New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image.

Self-labelling via simultaneous clustering and representation learning

The proposed novel and principled learning formulation is able to self-label visual data so as to train highly competitive image representations without manual labels and yields the first self-supervised AlexNet that outperforms the supervised Pascal VOC detection baseline.

Foreground Activation Maps for Weakly Supervised Object Localization

This work proposes foreground activation maps (FAM), whose aim is to optimize object localization and classification jointly via an object-aware attention module and a part-aware Attention module in a unified model, where the two tasks can complement and enhance each other.

Online Refinement of Low-level Feature Based Activation Map for Weakly Supervised Object Localization

A weighted entropy loss, an attentive erasing, and an area loss are proposed to drive the activation map generator to substantially reduce the uncertainty of activations between object and background, and explore less discriminative regions.


  • 2021