Learning to Separate Object Sounds by Watching Unlabeled Video

@article{Gao2018LearningTS,
  title={Learning to Separate Object Sounds by Watching Unlabeled Video},
  author={Ruohan Gao and Rog{\'e}rio Schmidt Feris and Kristen Grauman},
  journal={ArXiv},
  year={2018},
  volume={abs/1804.01665}
}
Perceiving a scene most fully requires all the senses. [] Key Method We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to study audio source separation in large-scale general "in the wild" videos. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.
Weakly-Supervised Audio-Visual Sound Source Detection and Separation
TLDR
An audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels, which outperforms state-of-the-art methods on visually guided sound source separation and sound denoising.
Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
TLDR
This work develops methods that identify events and localize corresponding AV cues in unconstrained videos using weak labels, and demonstrates the framework's ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization.
Self-Supervised Learning of Audio-Visual Objects from Video
TLDR
This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
Weakly-Supervised Audio-Visual Video Parsing Toward Unified Multisensory Perception
TLDR
This work forms the weakly-supervised audio-visual video parsing as a Multimodal Multiple Instance Learning (MMIL) problem and proposes a new framework to solve it, and develops an attentive MMIL pooling method for adaptively aggregating useful audio and visual content from different temporal extent and modalities.
Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation
TLDR
An audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components is proposed and can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth bINAural audio data during training.
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
TLDR
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data.
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video
TLDR
This work develops a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time.
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
TLDR
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels, and the proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems.
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
  • Ruohan Gao, K. Grauman
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
This work proposes to leverage the speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce from unlabeled video, and yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
...
...

References

SHOWING 1-10 OF 89 REFERENCES
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Looking to listen at the cocktail party
TLDR
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
Audio-visual object localization and separation using low-rank and sparsity
TLDR
A novel optimization problem, involving the minimization of nuclear norms and matrix ℓ1-norms is solved and the proposed method is evaluated in 1) visual localization and audio separation and 2) visual-assisted audio denoising.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Objects that Sound
TLDR
New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image.
The Sound of Pixels
TLDR
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.
Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization
TLDR
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs.
Visual to Sound: Generating Natural Sound for Videos in the Wild
TLDR
The task of generating sound given visual input is posed and learning-based methods are applied to generate raw waveform samples given input video frames to enable applications in virtual reality or provide additional accessibility to images or videos for people with visual impairments.
Discovering joint audio–visual codewords for video event detection
TLDR
This paper proposes a new representation—called bi-modal words—to explore representative joint audio–visual patterns in videos, and finds that average pooling is particularly suitable for bi- modal representation, and using multiple kernel learning to combine multi-Modal representations at various granularities is helpful.
Generative Modeling of Audible Shapes for Object Perception
TLDR
It is demonstrated that auditory and visual information play complementary roles in object perception, and further, that the representation learned on synthetic audio-visual data can transfer to real-world scenarios.
...
...