Robust methods for content analysis of auditory scenes
The increasing progress of audio analysis methods opens possibilities for more new applications. At the same time, recent improvements in these methods bring the established approaches constantly closer to their performance limits, which are defined by disturbing factors such as overlapping speech or noise and reverberation. This thesis presents progress in new possibilities and addressing disturbing factors, first, by proposing ideas for a system for the classification of acoustic scenes and a method for acoustic gait-based person identification. Both of them are two relatively new audio recognition tasks. Furthermore, improvements for two established methods (speaker diarization and robust speech recognition) are presented. To improve speaker diarization, different approaches to detect overlapping speech are proposed. To increase the robustness of a speech recognition system against noise and reverberation, an approach using memory-enhanced acoustic modelling is employed. Together, the proposed modules represent a complete system for auditory scene analysis. Starting from a coarse classification of the scene as a whole, persons can be identified using their step sounds or voice, followed by a transcription of the spoken contents. Experimental evaluations using publicly available databases or within public research challenges demonstrate the efficiency of the proposed methods.