Intelligent Single-Channel Methods for Multi-Source Audio Analysis
This thesis investigates the potential of recent machine learning methods for the challenging task of single-channel, multi-source audio audio analysis, i.e., information extraction from single-channel audio where the sources of interest (e.g., speech) are mixed with multiple interfering sources. First, it is shown that source separation by recently proposed techniques for non-negative matrix factorization can significantly improve the recognition performance, compared to the state-of-the-art approach of training the recognition task with multi-source data. Second, it is shown that by formulating the source separation problem itself as a recognition task, state-ofthe-art methods for supervised training of recognition systems such as deep neural network models can be used to achieve previously unseen performance in singlechannel source separation. In this context, supervised training of non-negative models is introduced as well. The task of multi-source recognition as defined above is exemplified by challenging real-world speech separation and recognition problems where speech is mixed with non-stationary background noise such as music, and world-leading results in international evaluation campaigns are demonstrated for this task. Furthermore, state-of-the-art results are presented in selected music information retrieval applications involving polyphonic audio, such as characterizing the singer, or transcribing the music into a score.