A Deep Reinforcement Learning Approach To Audio-Based Navigation In A Multi-Speaker Environment

  title={A Deep Reinforcement Learning Approach To Audio-Based Navigation In A Multi-Speaker Environment},
  author={Petros Giannakopoulos and Aggelos Pikrakis and Yannis Cotronis},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
In this work we use deep reinforcement learning to create an autonomous agent that can navigate in a two-dimensional space using only raw auditory sensory information from the environment, a problem that has received very little attention in the reinforcement learning literature. Our experiments show that the agent can successfully identify a particular target speaker among a set of N predefined speakers in a room and move itself towards that speaker, while avoiding collision with other… 

Figures from this paper

A Deep Reinforcement Learning Approach for Audio-based Navigation and Audio Source Localization in Multi-speaker Environments
This work creates two virtual environments using the Unity game engine and creates an autonomous agent based on PPO online reinforcement learning algorithm and attempts to train it to solve these environments, and shows that a degree of agent knowledge transfer is possible between the environments.


Do Autonomous Agents Benefit from Hearing?
Results show that the agent improves its behavior when visual information is accompanied with audio features, and multi-modal setup in reach-the-goal tasks in ViZDoom environment is assessed.
Asynchronous Methods for Deep Reinforcement Learning
A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates
This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features.
Deep Neural Networks for Multiple Speaker Detection and Localization
  • W. He, P. Motlícek, J. Odobez
  • Computer Science, Engineering
    2018 IEEE International Conference on Robotics and Automation (ICRA)
  • 2018
This paper proposes a likelihood-based encoding of the network output, which naturally allows the detection of an arbitrary number of sources, and investigates the use of sub-band cross-correlation information as features for better localization in sound mixtures.
ViZDoom: A Doom-based AI research platform for visual reinforcement learning
A novel test-bed platform for reinforcement learning research from raw visual information which employs the first-person perspective in a semi-realistic 3D world and confirms the utility of ViZDoom as an AI research platform and implies that visual reinforcement learning in 3D realistic first- person perspective environments is feasible.
Sound localization and multi-modal steering for autonomous virtual agents
A framework that enables autonomous virtual agents to localize sounds in dynamic virtual environments, subject to distortion effects due to attenuation, reflection and diffraction from obstacles, as well as interference between multiple audio signals is developed.
Deep learning methods in speaker recognition: a review
This paper reviews the applied Deep Learning practices in the field of Speaker Recognition, both in verification and identification, and seems that Deep Learning becomes the now state-of-the-art solution for both Speaker Verification (SV) and identification.
Robust Speaker Localization Guided by Deep Learning-Based Time-Frequency Masking
Deep learning-based time-frequency (T-F) masking has dramatically advanced monaural (single-channel) speech separation and enhancement. This study investigates its potential for direction of arrival
Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise
A novel method to train the CNN using synthesized noise signals and is evaluated for two speakers and compared to a well-known steered response power method.
Unity: A General Platform for Intelligent Agents
This work proposes a novel taxonomy of existing simulation platforms and discusses the highest level class of general platforms which enable the development of learning environments that are rich in visual, physical, task, and social complexity.