Learning Bimodal Structure in Audio–Visual Data

  title={Learning Bimodal Structure in Audio–Visual Data},
  author={Gianluca Monaci and Pierre Vandergheynst and Friedrich T. Sommer},
  journal={IEEE Transactions on Neural Networks},
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal… 

Reverberant speech separation based on audio-visual dictionary learning and binaural cues

This paper presents a novel method for modeling the audio-visual (AV) coherence based on dictionary learning, where a visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio- visual mask, which is then applied to the binaural signal for source separation.

Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking

A new AVDL algorithm is developed which features, a bimodality balanced and scalable matching criterion, a size and dimension adaptive dictionary, a fast search index for efficient coding, and cross-modality diverse sparsity.

Audio visual speech source separation via improved context dependent association model

The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA- and AVSS-based methods.

Lip movement and speech synchronization detection based on multimodal shift-invariant dictionary

In order to solve the issue of ignoring the successive and dynamic lip motion information in traditional audio-visual speech synchrony analysis models, a novel method based on shift-invariant learned

Audio-Visual Localization by Synthetic Acoustic Image Generation

This work proposes to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization, using a novel deep architecture trained to reconstruct the ground truth spatialized audio data collected by a microphone array from the associated video and its corresponding monaural audio signal.

Audiovisual Speech Source Separation

The purpose of this article is to provide an overview of the key methodologies in audio-visual speech source separation building from early methods which simply use the visual modality to identify speech activity through to sophisticated techniques which synthesise a full audio- visual model.

Use of bimodal coherence to resolve the permutation problem in convolutive BSS

Robust front-end for audio, visual and audio–visual speech classification

Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered, and the proposed method performs better than other reported methods in the literature over the same two public databases.

Audiovisual Speech Source Separation: An overview of key methodologies

Success in this emerging field will expand the application of voice-based machine interfaces, such as Siri, the intelligent personal assistant on the iPhone and iPad, to much more realistic settings and thereby provide more natural human?machine interfaces.



Audiovisual Gestalts

  • G. MonaciP. Vandergheynst
  • Computer Science
    2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06)
  • 2006
Experimental results show that extracting significant synchronous audiovisual events can detect the existing cross-modal correlation between those signals even in presence of distracting motion and acoustic noise, and confirm that temporal proximity between audiov is a key ingredient for the integration of information across modalities.

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

A system that searches for regions of the visual landscape that correlate highly with the acoustic signals and tags them as likely to contain an acoustic source and presents results on a speaker localization task is developed.

Video assisted speech source separation

This work uses a statistical model characterizing the nonlinear coherence between audio and visual features as a separation criterion for both instantaneous and convolutive mixtures to optimize the unmixing matrix for speech separation.

Noisy audio feature enhancement using audio-visual speech data

Improved automatic speech recognition in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.

Recent advances in the automatic recognition of audiovisual speech

The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.

Sparse and shift-Invariant representations of music

This article uses a sparse coding formulation within a generative model that explicitly enforces shift-invariance to extract salient structure from musical signals, and demonstrates its potential on two tasks in music analysis.

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

A novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques is presented, applied to the difficult and realistic case of convolutive mixtures.

Shift-Invariance Sparse Coding for Audio Classification

This paper presents an efficient algorithm for learning SISC bases, and shows that SISC's learned high-level representations of speech and music provide useful features for classification tasks within those domains.