Comparing time and frequency domain for audio event recognition using deep learning
Human activity recognition is a key component for socially enabled robots to effectively and naturally interact with humans. In this paper we exploit the fact that many human activities produce characteristic sounds from which a robot can infer the corresponding actions. We propose a novel recognition approach called Non-Markovian Ensemble Voting (NEV) able to classify multiple human activities in an on line fashion without the need for silence detection or audio stream segmentation. Moreover, the method can deal with activities that are extended over undefined periods in time. In a series of experiments in real reverberant environments, we are able to robustly recognize 22 different sounds that correspond to a number of human activities in a bathroom and kitchen context. Our method outperforms several established classification techniques.