• Corpus ID: 239768959

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments

  title={A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments},
  author={Petros Giannakopoulos and Aggelos Pikrakis and Yannis Cotronis},
In this work we apply deep reinforcement learning to the problems of navigating a three-dimensional environment and inferring the locations of human speaker audio sources within, in the case where the only available information is the raw sound from the environment, as a simulated human listener placed in the environment would hear it. For this purpose we create two virtual environments using the Unity game engine, one presenting an audio-based navigation problem and one presenting an audio… 

Figures and Tables from this paper


A Deep Reinforcement Learning Approach To Audio-Based Navigation In A Multi-Speaker Environment
In this work we use deep reinforcement learning to create an autonomous agent that can navigate in a two-dimensional space using only raw auditory sensory information from the environment, a problem
Agents that Listen: High-Throughput Reinforcement Learning with Multiple Sensory Systems
A new version of VizDoom simulator is introduced to create a highly efficient learning environment that provides raw audio observations and studies the performance of different model architectures in a series of tasks that require the agent to recognize sounds and execute instructions given in natural language.
Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates
This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features.
Deep Reinforcement Learning for Audio-Visual Gaze Control
A novel audio-visual fusion framework that is well suited for controlling the gaze of a robotic head and a reinforcement learning (RL) formulation for the gaze control problem, using a reward function based on the available temporal sequence of camera and microphone observations are addressed.
Sound localization and multi-modal steering for autonomous virtual agents
A framework that enables autonomous virtual agents to localize sounds in dynamic virtual environments, subject to distortion effects due to attenuation, reflection and diffraction from obstacles, as well as interference between multiple audio signals is developed.
Do Autonomous Agents Benefit from Hearing?
Results show that the agent improves its behavior when visual information is accompanied with audio features, and multi-modal setup in reach-the-goal tasks in ViZDoom environment is assessed.
ViZDoom: A Doom-based AI research platform for visual reinforcement learning
A novel test-bed platform for reinforcement learning research from raw visual information which employs the first-person perspective in a semi-realistic 3D world and confirms the utility of ViZDoom as an AI research platform and implies that visual reinforcement learning in 3D realistic first- person perspective environments is feasible.
Sound Source Localization Using Deep Learning Models
This study shows that with end-to-end training and generic preprocessing, the performance of deep residual networks not only surpasses the block level accuracy of linear models on nearly clean environments but also shows robustness to challenging conditions by exploiting the time delay on power information.
Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey
The fundamental background behind sim-to-real transfer in deep reinforcement learning is covered and the main methods being utilized at the moment: domain randomization, domain adaptation, imitation learning, meta-learning and knowledge distillation are overviewed.
Deep Neural Networks for Multiple Speaker Detection and Localization
  • W. He, P. Motlícek, J. Odobez
  • Computer Science, Engineering
    2018 IEEE International Conference on Robotics and Automation (ICRA)
  • 2018
This paper proposes a likelihood-based encoding of the network output, which naturally allows the detection of an arbitrary number of sources, and investigates the use of sub-band cross-correlation information as features for better localization in sound mixtures.