Learning Bimodal Structure in Audio–Visual Data

@article{Monaci2009LearningBS,
  title={Learning Bimodal Structure in Audio–Visual Data},
  author={Gianluca Monaci and Pierre Vandergheynst and Friedrich T. Sommer},
  journal={IEEE Transactions on Neural Networks},
  year={2009},
  volume={20},
  pages={1898-1910}
}
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal… 

Reverberant speech separation based on audio-visual dictionary learning and binaural cues

This paper presents a novel method for modeling the audio-visual (AV) coherence based on dictionary learning, where a visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio- visual mask, which is then applied to the binaural signal for source separation.

Sequential Audio-Visual Correspondence With Alternating Diffusion Kernels

This paper proposes a measure that is based on affinity kernels constructed separately in each modality that is motivated from both a kernel density estimation point of view of predicting the signal in one modality based on the other, as well as from a statistical model which implies that high values of the proposed measure are expected when signals highly correspond to each other.

Lip movement and speech synchronization detection based on multimodal shift-invariant dictionary

In order to solve the issue of ignoring the successive and dynamic lip motion information in traditional audio-visual speech synchrony analysis models, a novel method based on shift-invariant learned

Audio-Visual Localization by Synthetic Acoustic Image Generation

This work proposes to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization, using a novel deep architecture trained to reconstruct the ground truth spatialized audio data collected by a microphone array from the associated video and its corresponding monaural audio signal.

Use of bimodal coherence to resolve the permutation problem in convolutive BSS

Robust front-end for audio, visual and audio–visual speech classification

Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered, and the proposed method performs better than other reported methods in the literature over the same two public databases.

Robust front-end for audio, visual and audio–visual speech classification

Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered, and the proposed method performs better than other reported methods in the literature over the same two public databases.

Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding

This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding.

Example-based cross-modal denoising

This principle is demonstrated by using video to denoise audio by using an example-based approach and shows that a clear video can direct the audio estimator in cross-modal association, which can help denoising another modality.

References

SHOWING 1-10 OF 72 REFERENCES

Audiovisual Gestalts

  • G. MonaciP. Vandergheynst
  • Computer Science
    2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06)
  • 2006
Experimental results show that extracting significant synchronous audiovisual events can detect the existing cross-modal correlation between those signals even in presence of distracting motion and acoustic noise, and confirm that temporal proximity between audiov is a key ingredient for the integration of information across modalities.

Noisy audio feature enhancement using audio-visual speech data

Improved automatic speech recognition in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.

Recent advances in the automatic recognition of audiovisual speech

The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.

Sparse and shift-Invariant representations of music

This article uses a sparse coding formulation within a generative model that explicitly enforces shift-invariance to extract salient structure from musical signals, and demonstrates its potential on two tasks in music analysis.

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

A novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques is presented, applied to the difficult and realistic case of convolutive mixtures.

Shift-Invariance Sparse Coding for Audio Classification

This paper presents an efficient algorithm for learning SISC bases, and shows that SISC's learned high-level representations of speech and music provide useful features for classification tasks within those domains.

Multimodal speaker localization in a probabilistic framework

A multimodal probabilistic framework is proposed for the problem of finding the active speaker in a video sequence, and a novel visual feature is proposed that is well-suited for the analysis of the movement of the mouth.

Speaker association with signal-level audiovisual fusion

A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence and nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains.
...