Learn More
Recognizing people by the way they walk – also known as gait recognition – has been studied extensively in the recent past. Recent gait recognition methods solely focus on data extracted from an RGB video stream. With this work, we provide a means for multimodal gait recognition, by introducing the freely available TUM Gait from Audio, Image and Depth(More)
We present the Munich contribution to the PASCAL 'CHiME' Speech Separation and Recognition Challenge: Our approach combines source separation by supervised convolu-tive non-negative matrix factorisation (NMF) with our tandem recogniser that augments acoustic features by word predictions of a Long Short-Term Memory recurrent neural network in a multi-stream(More)
This paper describes our joint submission to the REVERB Challenge, which calls for automatic speech recognition systems which are robust against varying room acoustics. Our approach uses deep recurrent neural network (DRNN) based feature enhancement in the log spectral domain as a single-channel front-end. The system is generalized to multi-channel audio by(More)
We present our joint contribution to the 2nd CHiME Speech Separation and Recognition Challenge. Our system combines speech enhancement by supervised sparse non-negative matrix factorisation (NMF) with a multi-stream speech recognition system. In addition to a conventional MFCC HMM recogniser, predictions by a bidirectional Long Short-Term Memory recurrent(More)
This work describes a system for acoustic scene classification using large-scale audio feature extraction. It is our contribution to the Scene Classification track of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (D-CASE). The system classifies 30 second long recordings of 10 different acoustic scenes. From the highly(More)
We present a highly efficient, data-based method for monaural feature enhancement targeted at automatic speech recognition (ASR) in rever-berant environments with highly non-stationary noise. Our approach is based on bidirectional Long Short-Term Memory recurrent neural networks trained to map noise corrupted features to clean features. In extensive test(More)
This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end for(More)
In the light of the improvements that were made in the last years with neural network-based acoustic models, it is an interesting question whether these models are also suited for noise-robust recognition. This has not yet been fully explored, although first experiments confirm this question. Furthermore, preprocessing techniques that improve the robustness(More)
Acoustic event detection in surveillance scenarios is an important but difficult problem. Realistic systems are struggling with noisy recording conditions. In this work, we propose to use Gabor filterbank features to detect target events in different noisy background scenes. These features capture spectro-temporal modulation frequencies in the signal, which(More)
This paper presents the TUM contribution to the 2014 REVERB Challenge: we describe a system for robust recognition of reverberated speech. In addition to an HMM-GMM recogniser, we use bidirectional long short-term memory (LSTM) recurrent neural networks. These networks can exploit long-range temporal context by using memory cells in the hidden units, which(More)