Learning Bimodal Structure in Audio–Visual Data
@article{Monaci2009LearningBS, title={Learning Bimodal Structure in Audio–Visual Data}, author={Gianluca Monaci and Pierre Vandergheynst and Friedrich T. Sommer}, journal={IEEE Transactions on Neural Networks}, year={2009}, volume={20}, pages={1898-1910} }
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal…
Figures and Tables from this paper
38 Citations
Reverberant speech separation based on audio-visual dictionary learning and binaural cues
- Physics, Computer Science2012 IEEE Statistical Signal Processing Workshop (SSP)
- 2012
This paper presents a novel method for modeling the audio-visual (AV) coherence based on dictionary learning, where a visual mask is constructed from the video signal based on the learnt AV dictionary, and incorporated with the audio mask to obtain a noise-robust audio- visual mask, which is then applied to the binaural signal for source separation.
Sequential Audio-Visual Correspondence With Alternating Diffusion Kernels
- Computer ScienceIEEE Transactions on Signal Processing
- 2018
This paper proposes a measure that is based on affinity kernels constructed separately in each modality that is motivated from both a kernel density estimation point of view of predicting the signal in one modality based on the other, as well as from a statistical model which implies that high values of the proposed measure are expected when signals highly correspond to each other.
Lip movement and speech synchronization detection based on multimodal shift-invariant dictionary
- Computer Science2015 IEEE 16th International Conference on Communication Technology (ICCT)
- 2015
In order to solve the issue of ignoring the successive and dynamic lip motion information in traditional audio-visual speech synchrony analysis models, a novel method based on shift-invariant learned…
Audio-Visual Localization by Synthetic Acoustic Image Generation
- PhysicsAAAI
- 2021
This work proposes to leverage the generation of synthetic acoustic images from common audio-video data for the task of audio-visual localization, using a novel deep architecture trained to reconstruct the ground truth spatialized audio data collected by a microphone array from the associated video and its corresponding monaural audio signal.
Use of bimodal coherence to resolve the permutation problem in convolutive BSS
- Computer ScienceSignal Process.
- 2012
Robust front-end for audio, visual and audio–visual speech classification
- Computer ScienceInternational Journal of Speech Technology
- 2018
Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered, and the proposed method performs better than other reported methods in the literature over the same two public databases.
Robust front-end for audio, visual and audio–visual speech classification
- Computer ScienceInt. J. Speech Technol.
- 2018
Experimental results show that a good performance is achieved with the proposed system over the three databases and for the different kinds of input information being considered, and the proposed method performs better than other reported methods in the literature over the same two public databases.
Audio-visual localization with hierarchical topographic maps: Modeling the superior colliculus
- Biology, PsychologyNeurocomputing
- 2012
Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding
- Physics, Computer ScienceIEEE Transactions on Image Processing
- 2022
This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding.
Example-based cross-modal denoising
- Computer Science2012 IEEE Conference on Computer Vision and Pattern Recognition
- 2012
This principle is demonstrated by using video to denoise audio by using an example-based approach and shows that a clear video can direct the audio estimator in cross-modal association, which can help denoising another modality.
References
SHOWING 1-10 OF 72 REFERENCES
Audiovisual Gestalts
- Computer Science2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06)
- 2006
Experimental results show that extracting significant synchronous audiovisual events can detect the existing cross-modal correlation between those signals even in presence of distracting motion and acoustic noise, and confirm that temporal proximity between audiov is a key ingredient for the integration of information across modalities.
Analysis of multimodal sequences using geometric video representations
- Computer ScienceSignal Process.
- 2006
Noisy audio feature enhancement using audio-visual speech data
- Computer Science2002 IEEE International Conference on Acoustics, Speech, and Signal Processing
- 2002
Improved automatic speech recognition in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.
Visual voice activity detection as a help for speech source separation from convolutive mixtures
- Computer ScienceSpeech Commun.
- 2007
Recent advances in the automatic recognition of audiovisual speech
- Computer ScienceProc. IEEE
- 2003
The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.
Sparse and shift-Invariant representations of music
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2006
This article uses a sparse coding formulation within a generative model that explicitly enforces shift-invariance to extract salient structure from musical signals, and demonstrates its potential on two tasks in music analysis.
Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures
- EngineeringIEEE Transactions on Audio, Speech, and Language Processing
- 2007
A novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques is presented, applied to the difficult and realistic case of convolutive mixtures.
Shift-Invariance Sparse Coding for Audio Classification
- Computer ScienceUAI
- 2007
This paper presents an efficient algorithm for learning SISC bases, and shows that SISC's learned high-level representations of speech and music provide useful features for classification tasks within those domains.
Multimodal speaker localization in a probabilistic framework
- Computer Science2006 14th European Signal Processing Conference
- 2006
A multimodal probabilistic framework is proposed for the problem of finding the active speaker in a video sequence, and a novel visual feature is proposed that is well-suited for the analysis of the movement of the mouth.
Speaker association with signal-level audiovisual fusion
- Computer ScienceIEEE Transactions on Multimedia
- 2004
A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence and nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains.