Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

@inproceedings{Owens2018AudioVisualSA,
  title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
  author={Andrew Owens and Alexei A. Efros},
  booktitle={European Conference on Computer Vision},
  year={2018}
}
The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. [] Key Method We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's…

Crossmodal learning for audio-visual speech event localization

This work presents visual representations that have implicit information about when someone is talking and where and proposes a crossmodal neural network for audio speech event detection using the visual frames.

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

A novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks is proposed.

Self-Supervised Learning of Audio-Visual Objects from Video

This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

Weakly-Supervised Audio-Visual Video Parsing Toward Unified Multisensory Perception

This work forms the weakly-supervised audio-visual video parsing as a Multimodal Multiple Instance Learning (MMIL) problem and proposes a new framework to solve it, and develops an attentive MMIL pooling method for adaptively aggregating useful audio and visual content from different temporal extent and modalities.

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

  • Ruohan GaoK. Grauman
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work proposes to leverage the speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce from unlabeled video, and yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.

Looking to listen at the cocktail party

A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.

Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment

This work presents a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio- visual correspondence (AVC), which learns from the spatial location of acoustic and visual content.

Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations

  • Lingyu ZhuEsa Rahtu
  • Computer Science
    2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
  • 2022
This paper performs audio-visual sound source separation, i.e. to separate component audios from a mixture based on the videos of sound sources, and proposes a two-stage architecture, called Appearance and Motion network (AM-net), where the stages specialise to appearance and motion cues, respectively.

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

This paper proposes a novel self-supervised approach to learn a cross-modal feature representation that captures both the category and location of each sound source using stereo sound as input and applies this method to cross- modal image/audio retrieval.

Learning to Separate Object Sounds by Watching Unlabeled Video

This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
...

References

SHOWING 1-10 OF 85 REFERENCES

Audio-visual graphical models for speech processing

This work proposes to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data and shows some results for speech detection and enhancement.

Looking to listen at the cocktail party

A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.

Audio-visual scene analysis: evidence for a "very-early" integration process in audio-visual speech perception

It is shown here that the “speech detection” benefit may result in a ‘speech identification’ benefit different from lipreading per se, and extraction of auditory cues thanks to visual movements can be understood as some kind of "very early" fusion process.

Learning to Separate Object Sounds by Watching Unlabeled Video

This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

The face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model, and the speech predictions are applied as a filter on the noisy input audio.

Detecting audio-visual synchrony using deep neural networks

This paper addresses the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not, and investigates the use of deep neural networks (DNNs) for this purpose.

Harmony in Motion

An approach that acknowledges the importance of temporal features that are based on significant changes in each modality and identifies temporal coincidences between these features, yielding cross-modal association and visual localization is described.

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

A system that searches for regions of the visual landscape that correlate highly with the acoustic signals and tags them as likely to contain an acoustic source and presents results on a speaker localization task is developed.

Ausio-visual Segmentation and "The Cocktail Party Effect"

It is shown how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams, and how the method can help reduce the effect of noise in automatic speech recognition.

Objects that Sound

New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image.
...