• Corpus ID: 231786684

Music source separation conditioned on 3D point clouds

  title={Music source separation conditioned on 3D point clouds},
  author={Francesc Llu'is and Vasileios Chatziioannou and Alex Hofmann},
Recently, significant progress has been made in audio source separation by the application of deep learning techniques. Current methods that combine both audio and visual information use 2D representations such as images to guide the separation process. However, in order to (re)-create acoustically correct scenes for 3D virtual/augmented reality applications from recordings of real music ensembles, detailed information about each sound source in the 3D environment is required. This demand… 

Figures and Tables from this paper

Points2Sound: From mono to binaural audio using 3D point cloud scenes
Experimental results indicate that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis, and investigates different loss functions and 3D point cloud attributes.
Detector-Free Weakly Supervised Grounding by Separation
The key idea behind the proposed Grounding by Separation (GbS) method is synthesizing ‘text to image-regions’ associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network.


Anechoic audio and 3D-video content database of small ensemble performances for virtual concerts
Details related to the creation of a public database of anechoic audio and 3D-video recordings of several small music ensemble performances, providing the community with flexible audiovisual content for virtual acoustic simulations are presented.
Music Gesture for Visual Sound Separation
This work proposes ``Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music, which adopts a context-aware graph network to integrate visual semantic context with body dynamics and applies an audio-visual fusion model to associate body movements with the corresponding audio signals.
Learning to Separate Object Sounds by Watching Unlabeled Video
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.
Separating Sounds from a Single Image
This paper investigates the performance of appearance information, extracted from a single image, in the task of recovering the original component signals from a mixture audio and introduces an efficient appearance attention module to improve the sound separation performance.
4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks
This work creates an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks and proposes the hybrid kernel, a special case of the generalized sparse convolution, and trilateral-stationary conditional random fields that enforce spatio-temporal consistency in the 7D space-time-chroma space.
Open-Unmix - A Reference Implementation for Music Source Separation
Open-Unmix provides implementations for the most popular deep learning frameworks, giving researchers a flexible way to reproduce results and provides a pre-trained model for end users and even artists to try and use source separation.
Solos: A Dataset for Audio-Visual Music Analysis
In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and
Music Source Separation in the Waveform Domain
Demucs is proposed, a new waveform-to-waveform model, which has an architecture closer to models for audio generation with more capacity on the decoder, and human evaluations show that Demucs has significantly higher quality than Conv-Tasnet, but slightly more contamination from other sources, which explains the difference in SDR.
3D ShapeNets: A deep representation for volumetric shapes
This work proposes to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network, and shows that this 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.