Learning to Separate Object Sounds by Watching Unlabeled Video

@article{Gao2018LearningTS,
  title={Learning to Separate Object Sounds by Watching Unlabeled Video},
  author={Ruohan Gao and Rog{\'e}rio Schmidt Feris and Kristen Grauman},
  journal={ArXiv},
  year={2018},
  volume={abs/1804.01665}
}
Perceiving a scene most fully requires all the senses. [] Key Method We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to study audio source separation in large-scale general "in the wild" videos. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.

Co-Separating Sounds of Visual Objects

  • Ruohan GaoK. Grauman
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.

Weakly-Supervised Audio-Visual Sound Source Detection and Separation

TLDR
An audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels, which outperforms state-of-the-art methods on visually guided sound source separation and sound denoising.

Weakly Supervised Representation Learning for Audio-Visual Scene Analysis

TLDR
This work develops methods that identify events and localize corresponding AV cues in unconstrained videos using weak labels, and demonstrates the framework's ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization.

Self-Supervised Learning of Audio-Visual Objects from Video

TLDR
This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

Visual Scene Graphs for Audio Source Separation

TLDR
An “in the wild” video dataset for sound source separation that contains multiple non-musical sources, which is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation.

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

TLDR
An audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components is proposed and can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth bINAural audio data during training.

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

TLDR
This work develops a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time.

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing

TLDR
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels, and the proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems.

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

  • Ruohan GaoK. Grauman
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
This work proposes to leverage the speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce from unlabeled video, and yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
...

References

SHOWING 1-10 OF 89 REFERENCES

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

Looking to listen at the cocktail party

TLDR
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.

Look, Listen and Learn

TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.

Audio-visual object localization and separation using low-rank and sparsity

TLDR
A novel optimization problem, involving the minimization of nuclear norms and matrix ℓ1-norms is solved and the proposed method is evaluated in 1) visual localization and audio separation and 2) visual-assisted audio denoising.

SoundNet: Learning Sound Representations from Unlabeled Video

TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

Blind Audiovisual Source Separation Based on Sparse Redundant Representations

TLDR
A novel method is proposed which exploits the correlation between the video signal captured with a camera and a synchronously recorded one-microphone audio track to detect and separate audiovisual sources present in a scene.

Objects that Sound

TLDR
New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image.

The Sound of Pixels

TLDR
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.

Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

TLDR
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.

Visual to Sound: Generating Natural Sound for Videos in the Wild

TLDR
The task of generating sound given visual input is posed and learning-based methods are applied to generate raw waveform samples given input video frames to enable applications in virtual reality or provide additional accessibility to images or videos for people with visual impairments.
...