Audio-Visual SLAM towards Human Tracking and Human-Robot Interaction in Indoor Environments

  title={Audio-Visual SLAM towards Human Tracking and Human-Robot Interaction in Indoor Environments},
  author={Aaron D. Chau and Kouhei Sekiguchi and Aditya Arie Nugraha and Kazuyoshi Yoshii and Kotaro Funakoshi},
  journal={2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)},
We propose a novel audio-visual simultaneous and localization (SLAM) framework that exploits human pose and acoustic speech of human sound sources to allow a robot equipped with a microphone array and a monocular camera to track, map, and interact with human partners in an indoor environment. Since human interaction is characterized by features perceived in not only the visual modality, but the acoustic modality as well, SLAM systems must utilize information from both modalities. Using a state… 

Figures from this paper

3D Localization of a Sound Source Using Mobile Microphone Arrays Referenced by SLAM
The approach explored in this paper consists of having two robots, each equipped with a microphone array, localizing themselves in a shared reference map using SLAM, and data from the microphone arrays are used to triangulate in 3D the location of a sound source in relation to the same map.
Object Permanence Through Audio-Visual Representations
A multimodal neural network model is developed that predicts the full bounce trajectory and the end location of a dropped object and was able to retrieve dropped objects by applying minimal vision-based pick-up adjustments and outperformed the vision-only and audio-only baselines in retrieving dropped objects.


Acoustic SLAM
  • C. Evers, P. Naylor
  • Physics
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2018
This paper proposes Acoustic Simultaneous Localization and Mapping (aSLAM), which uses acoustic signals to simultaneously map the 3D positions of multiple sound sources while passively localizing the observer within the scene map.
Optimized Self-Localization for SLAM in Dynamic Scenes Using Probability Hypothesis Density Filters
The proposed approach probabilistically anchors the observer state by fusing observer information inferred from the scene with reports of the observer motion, and generalizes existing Probability Hypothesis Density (PHD)-based SLAM algorithms.
Real-time super-resolution Sound Source Localization for robots
This work proposes two methods, MUSIC based on Generalized Singular Value Decomposition (GSVD-MUSIC), and Hierarchical SSL (H-SSL), which drastically reduces the computational cost while maintaining noise-robustness in localization.
Monocular vision-based human following on miniature robotic blimp
An approach that allows the Georgia Tech Miniature Autonomous Blimp (GT-MAB) to detect and follow a human is presented, which is the first Human Robot Interaction (HRI) demonstration between an uninstrumented human and a robotic blimp.
OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
OpenPose is released, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints, and the first combined body and foot keypoint detector, based on an internal annotated foot dataset.
A Random-Finite-Set Approach to Bayesian SLAM
Simulated and experimental results demonstrate the merits of the proposed approach, particularly in situations of high clutter and data association ambiguity.
Simultaneous localization and mapping: part I
This paper describes the simultaneous localization and mapping (SLAM) problem and the essential methods for solving the SLAM problem and summarizes key implementations and demonstrations of the
Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition
A weighting process adaptive to various background noise situations is developed following a Separate Integration (SI) architecture and a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated.
Neural network based spectral mask estimation for acoustic beamforming
A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients.
Consistency of the EKF-SLAM Algorithm
It is shown that the algorithm produces very optimistic estimates once the "true" uncertainty in vehicle heading exceeds a limit, and the manageable degradation of small heading variance SLAM indicates the efficacy of submap methods for large-scale maps.