Bio-Inspired Modality Fusion for Active Speaker Detection

  title={Bio-Inspired Modality Fusion for Active Speaker Detection},
  author={Gustavo Assunç{\~a}o and Nuno Gonccalves and Paulo Menezes},
Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation… 

Figures from this paper


Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition
An extensive evaluation of the proposed self-supervised method for visual detection of the active speaker in a multiperson spoken interaction scenario using a large multiperson face-to-face interaction data set concludes that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.
Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
A novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input is described which outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.
Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs
A mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers is proposed by analyzing the individual speakers' spatio-temporal activities and mouth movements by using Support Vector Machines and Hidden Markov Models for assessing the video and audio modalities through an RGB camera and a microphone array.
Looking to listen at the cocktail party
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
Cross-Modal Supervision for Learning Active Speaker Detection in Video
This work is the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset, and is seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision.
Active-speaker detection and localization with microphones and cameras embedded into a robotic head
A method for detecting and localizing an active speaker through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones is presented, which enables natural human-robot interactive behavior.
Vision-based Active Speaker Detection in Multiparty Interaction
The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization
This paper presents a bottom-up approach that combines audio and video to simultaneously locate individual speakers in the video (2D source localization) and segment their speech (speaker
Look who's talking: visual identification of the active speaker in multi-party human-robot interaction
A data-driven methodology for automatic visual identification of the active speaker based on facial action units (AUs) will be implemented on a robot and used to generate natural focus of visual attention behavior during multi-party human-robot interactions.
Audio-visual speaker localization via weighted clustering
This paper proposes a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations and introduces a weighted-data mixture model and formally devise the associated EM procedure.