Corpus ID: 235458372

Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention

  title={Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention},
  author={Efthymios Tzinis and Scott Wisdom and Tal Remez and J. Hershey},
We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audiovisual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention… Expand
1 Citations

Figures and Tables from this paper

Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition
A Multi-modal Residual Perceptron Network is defined which performs end-to-end learning from multi- modal network branches, generalizing better multi-modals feature representation and shows its potential for multi-Modal applications dealing with signal sources not only of optical and acoustical types. Expand


Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
This work presents AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos, using a dataset of video clips extracted from open-domain YFCC100m video data. Expand
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science, Engineering
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets. Expand
Weakly-supervised Audio-visual Sound Source Detection and Separation
An audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels, which outperforms state-of-the-art methods on visually guided sound source separation and sound denoising. Expand
Learning to Separate Object Sounds by Watching Unlabeled Video
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising. Expand
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. Expand
Visually Guided Sound Source Separation and Localization using Self-Supervised Motion Representations
This paper proposes a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues, respectively, and introduces an Audio-Motion Embedding framework to explicitly represent the motions that related to sound. Expand
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
A novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks is proposed. Expand
Self-Supervised Learning of Audio-Visual Objects from Video
This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection. Expand
Looking to listen at the cocktail party
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. Expand
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
It is demonstrated that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Expand