Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers

@article{Ban2021VariationalBI,
  title={Variational Bayesian Inference for Audio-Visual Tracking of Multiple Speakers},
  author={Yutong Ban and Xavier Alameda-Pineda and Laurent Girin and Radu Horaud},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2021},
  volume={43},
  pages={1761-1776}
}
In this article, we address the problem of tracking multiple speakers via the fusion of visual and auditory information. We propose to exploit the complementary nature and roles of these two modalities in order to accurately estimate smooth trajectories of the tracked persons, to deal with the partial or total absence of one of the modalities over short periods of time, and to estimate the acoustic status–either speaking or silent–of each tracked person over time. We propose to cast the problem… Expand
Tracking Multiple Audio Sources With the von Mises Distribution and Variational EM
TLDR
This letter proposes an audio-source birth method that favors smooth source trajectories and which is used both to initialize the number of active sources and to detect new sources, and infer a variational approximation of the filtering distribution. Expand
Audio-visual tracking of concurrent speakers
TLDR
Experiments show that the proposed tracker outperforms the uni-modal trackers and the state-of-the-art approaches both in 3D and on the image plane. Expand
Advances in Online Audio-Visual Meeting Transcription
TLDR
A system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera and an online audio-visual speaker diarization method that leverages face tracking and identification, sound source localization, speaker identification and, if available, prior speaker information for robustness to various real world challenges is described. Expand
Audio-Visual Variational Fusion for Multi-Person Tracking with Robots
TLDR
This demo wants to present its, now mature, achievements in the field, and demonstrate two robotic systems able to track multiple persons using auditory and visual cues, when they are available. Expand
Self-Supervised Moving Vehicle Tracking With Stereo Sound
TLDR
This work proposes a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time, and demonstrates that the proposed approach outperforms several baseline approaches. Expand
Multi-Target DoA Estimation with an Audio-Visual Fusion Mechanism
TLDR
This work proposes a novel video simulation method that generates visual features from noisy target 3D annotations that are synchronized with acoustic features and confirms that audio-visual fusion consistently improves the performance of speaker DoA estimation, while the adaptive weighting mechanism shows clear benefits. Expand
Audio-Visual Multi-Speaker Tracking Based on the GLMB Framework
TLDR
A Generalized Labelled Multi-Bernoulli (GLMB)based framework that jointly estimates the number of targets and their respective states online and experimental results demonstrate the effectiveness of the proposed method. Expand
2020-02718-PhD Position F/M Deep Probabilistic Reinforcement Learning for Audio-Visual Human-Robot Interaction
  • 2020
Reinforcement learning, and in particular deep reinforcement learning (DRL), became very popular in the recent, successfully addressing a wide variety of tasks such as board game playing. It has alsoExpand
Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling
TLDR
A self-supervised training method using 360◦ images and multichannel audio signals that trains deep neural networks to distinguish multiple sound source objects and demonstrated that the DNNs trained by the method localized multiple speakers. Expand
3D Audiovisual Speaker Tracking with Distributed Sensors Configuration
TLDR
The estimation of azimuth and elevation from audio information to be fused with the position estimation obtained from a Viola and Jones based observation model to reduce the distance estimation uncertainty from the video model and improve tracking accuracy. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 46 REFERENCES
Tracking Multiple Audio Sources With the von Mises Distribution and Variational EM
TLDR
This letter proposes an audio-source birth method that favors smooth source trajectories and which is used both to initialize the number of active sources and to detect new sources, and infer a variational approximation of the filtering distribution. Expand
Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings
TLDR
Results are presented that show that the framework is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy, can deal with cases of visual clutter and occlusion, and significantly outperforms a traditional sampling-based approach. Expand
Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments
TLDR
The direct-path relative transfer function (DP-RTF), an interchannel feature that encodes acoustic information robust against reverberation, is used, and an online algorithm well suited for estimating DP-R TFs associated with moving audio sources is proposed. Expand
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
TLDR
The proposed audio-visual spatiotemporal diarization model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Expand
Mean-Shift and Sparse Sampling-Based SMC-PHD Filtering for Audio Informed Visual Speaker Tracking
TLDR
The audio data is proposed to be used to improve the visual SMC-PHD (V-SMC-P HD) filter by using the direction of arrival angles of the audio sources to determine when to propagate the born particles and reallocate the surviving and spawned particles. Expand
Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking
TLDR
This paper proposes a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces and is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Expand
An on-line variational Bayesian model for multi-person tracking from cluttered scenes
TLDR
An on-line variational Bayesian model for multi-person tracking from cluttered visual observations provided by person detectors is proposed and shows competitive results with respect to state-of-the-art multiple-object tracking algorithms, such as the probability hypothesis density (PHD) filter, among others. Expand
Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering
TLDR
An algorithm which adapts both the number of particles and noise variance based on tracking error and the area occupied by the particles in the image is designed, which is improved by solving a typical problem associated with the PF. Expand
Multimodal Speaker Diarization
We present a novel probabilistic framework that fuses information coming from the audio and video modality to perform speaker diarization. The proposed framework is a Dynamic Bayesian Network (DBN)Expand
Multiple-Speaker Localization Based on Direct-Path Features and Likelihood Maximization With Spatial Sparsity Regularization
TLDR
This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene using a complex-valued Gaussian mixture model and extends the DP-RTF estimation to the case of multiple sources. Expand
...
1
2
3
4
5
...