Bio-Inspired Modality Fusion for Active Speaker Detection

@article{Assuno2021BioInspiredMF,
  title={Bio-Inspired Modality Fusion for Active Speaker Detection},
  author={Gustavo Assunç{\~a}o and Nuno Gonccalves and Paulo Menezes},
  journal={ArXiv},
  year={2021},
  volume={abs/2003.00063}
}
Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation… 

Figures from this paper

References

SHOWING 1-10 OF 39 REFERENCES
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition
TLDR
An extensive evaluation of the proposed self-supervised method for visual detection of the active speaker in a multiperson spoken interaction scenario using a large multiperson face-to-face interaction data set concludes that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.
Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
TLDR
A novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input is described which outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.
Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs
TLDR
A mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers is proposed by analyzing the individual speakers' spatio-temporal activities and mouth movements by using Support Vector Machines and Hidden Markov Models for assessing the video and audio modalities through an RGB camera and a microphone array.
Looking to listen at the cocktail party
TLDR
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
Cross-Modal Supervision for Learning Active Speaker Detection in Video
TLDR
This work is the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset, and is seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision.
Active-speaker detection and localization with microphones and cameras embedded into a robotic head
TLDR
A method for detecting and localizing an active speaker through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones is presented, which enables natural human-robot interactive behavior.
Vision-based Active Speaker Detection in Multiparty Interaction
TLDR
The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.
Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization
This paper presents a bottom-up approach that combines audio and video to simultaneously locate individual speakers in the video (2D source localization) and segment their speech (speaker
Look who's talking: visual identification of the active speaker in multi-party human-robot interaction
TLDR
A data-driven methodology for automatic visual identification of the active speaker based on facial action units (AUs) will be implemented on a robot and used to generate natural focus of visual attention behavior during multi-party human-robot interactions.
Audio-visual speaker localization via weighted clustering
TLDR
This paper proposes a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations and introduces a weighted-data mixture model and formally devise the associated EM procedure.
...
1
2
3
4
...