Suggesting Sounds for Images from Video Collections

  title={Suggesting Sounds for Images from Video Collections},
  author={Matthias Sol{\`e}r and Jean Charles Bazin and Oliver Wang and Andreas Krause and Alexander Sorkine-Hornung},
  booktitle={ECCV Workshops},
Given a still image, humans can easily think of a sound associated with this image. [] Key Method We present an unsupervised, clustering-based solution that is able to automatically separate correlated sounds from uncorrelated ones. The core algorithm is based on a joint audio-visual feature space, in which we perform iterated mutual kNN clustering in order to effectively filter out uncorrelated sounds. To this end we also introduce a new dataset of correlated audio-visual data, on which we evaluate our…

Self-Supervised Generation of Spatial Audio for 360 Video

This work introduces an approach to convert mono audio recorded by a 360° video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere, and shows that it is possible to infer the spatial localization of sounds based only on a synchronized360° video and the mono audio track.

Dancing with the sound in edge computing environments

A novel dancing with the sound task, which takes the sound as an indicator input and outputs the dancing pose sequence, is proposed, which encodes the continuity and rhythm of the sound information into the hidden space to generate a coherent, diverse, rhythmic and long-term pose video.

Speech2Face: Learning the Face Behind a Voice

This paper designs and trains a deep neural network to perform the task of reconstructing a facial image of a person from a short audio recording of that person speaking, and evaluates and numerically quantify how these Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.

On Learning Associations of Faces and Voices

It is confirmed that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy and computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching Faces and voices with performance similar to that of humans.

Automated Music Generation for Visual Art through Emotion

Two different types of music modelling methods based on RNN and Transformer architectures are explored to build models capable of generating music given an image as input, suggesting that both music generators are able to express music with an emotional connection.

FaceSyncNet: A Deep Learning-Based Approach for Non-Linear Synchronization of Facial Performance Videos

This work leverages large-scale video datasets along with their associated audio tracks and trains a deep learning network to learn the audio descriptors of a given video frame to compute a low-cost non-linear synchronization path.

Audio to Body Dynamics

An LSTM network is built that is trained on violin and piano recital videos uploaded to the Internet and the predicted points are applied onto a rigged avatar to create the animation of an avatar.

Unveiling unexpected training data in internet video

Using clever video curation and processing practices to extract video training signals automatically, this research uncovers new ways to improve the quality of training and reduce the amount of waste.

cvpaper.challenge in 2016: Futuristic Computer Vision through 1, 600 Papers Survey

The paper gives futuristic challenges disscussed in the cvpaper.challenge. In 2015 and 2016, we thoroughly study 1,600+ papers in several conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV.

Learning to Localize Sound Source in Visual Scenes

A novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes, and a two-stream network structure which handles each modality, with attention mechanism is developed for sound source localization.



Semantic Annotation and Retrieval of Music and Sound Effects

We present a computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a

Harmony in Motion

An approach that acknowledges the importance of temporal features that are based on significant changes in each modality and identifies temporal coincidences between these features, yielding cross-modal association and visual localization is described.

Picasso - to sing, you must close your eyes and draw

A large training set consisting of over 40,000 image/soundtrack samples obtained from 28 movies is created and the suitability of PICASSO is evaluated by means of a user study.

Pixels that sound

This work presents a stable and robust algorithm which grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution based on canonical correlation analysis (CCA), which effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels.

Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects

A novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio to solve the problem of audio-video synchronization and is used to aid interactive segmentation.

The visual microphone

This paper explores how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and uses the spatial resolution of the method to visualize how sound-related vibrations vary over an object's surface, which it can use to recover the vibration modes of an object.

Scene Summarization for Online Image Collections

This work proposes a solution to the problem of scene summarization by examining the distribution of images in the collection to select a set of canonical views to form the scene summary, using clustering techniques on visual features.

Multiscale Approaches To Music Audio Feature Learning

Three approaches to multiscale audio feature learning using the spherical K-means algorithm are developed and compared and evaluated in an automatic tagging task and a similarity metric learning task on the Magnatagatune dataset.

A dataset for Movie Description

Comparing ADs to scripts, it is found that ADs are far more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production.

Distilled Collections from Textual Image Queries

We present a distillation algorithm which operates on a large, unstructured, and noisy collection of internet images returned from an online object query. We introduce the notion of a distilled set,