Solos: A Dataset for Audio-Visual Music Analysis

@article{Montesinos2020SolosAD,
  title={Solos: A Dataset for Audio-Visual Music Analysis},
  author={Juan F. Montesinos and Olga Slizovskaia and G. Haro},
  journal={2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)},
  year={2020},
  pages={1-6}
}
In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a… Expand
Music source separation conditioned on 3D point clouds
TLDR
It is shown, that the presented model can distinguish the musical instruments from a single 3D point cloud frame, and perform source separation qualitatively similar to a reference case, where manually assigned instrument labels are provided. Expand
Points2Sound: From mono to binaural audio using 3D point cloud scenes
TLDR
Points2Sound is proposed, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes and both quantitative and perceptual evaluations indicate that the proposed model is preferred over a reference case. Expand
Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements
TLDR
A novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting using a Vector Quantized Variational Autoencoder (VQ-VAE) with multi-band residual blocks. Expand

References

SHOWING 1-10 OF 46 REFERENCES
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
TLDR
OpenPose is released, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints, and the first combined body and foot keypoint detector, based on an internal annotated foot dataset. Expand
Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications
We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separatelyExpand
Interleaved Multitask Learning for Audio Source Separation with Independent Databases
TLDR
A model that decomposes the learnable parameters into a shared parametric model (encoder) and independent components (decoders) specific to each source and thus does not require each sample to possess a ground truth for all of its composing sources is presented. Expand
Vision-Infused Deep Audio Inpainting
TLDR
This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI). Expand
The Sound of Pixels
TLDR
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation. Expand
Dilated Residual Networks
TLDR
It is shown that dilated residual networks (DRNs) outperform their non-dilated counterparts in image classification without increasing the models depth or complexity and the accuracy advantage of DRNs is further magnified in downstream applications such as object localization and semantic segmentation. Expand
Music Gesture for Visual Sound Separation
TLDR
This work proposes ``Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music, which adopts a context-aware graph network to integrate visual semantic context with body dynamics and applies an audio-visual fusion model to associate body movements with the corresponding audio signals. Expand
Co-Separating Sounds of Visual Objects
  • Ruohan Gao, K. Grauman
  • Computer Science, Engineering
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This work introduces a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos, and obtains state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets. Expand
Learning Individual Styles of Conversational Gesture
TLDR
A method for cross-modal translation from "in-the-wild" monologue speech of a single speaker to their conversational gesture motion is presented and significantly outperforms baseline methods in a quantitative comparison. Expand
Online Audio-Visual Source Association for Chamber Music Performances
TLDR
A computational system that models audio- visual correspondences to achieve source association for Western chamber music ensembles including strings, woodwind, and brass instruments and enables novel applications such as interactive audio-visual music editing and auto-whirling camera in concerts. Expand
...
1
2
3
4
5
...