Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition
Automatic speech recognition ASR using distant far-field microphones is a challenging task, in which room reverberation is one of the primary causes of performance degradation. This study proposes a multichannel spectral enhancement method for reverberation-robust ASR using distributed microphones. The proposed method uses the techniques of nonnegative tensor factorization in order to identify the clean speech component from a set of observed reverberant spectrograms from the different channels. The general family of alpha-beta divergences is used for the tensor decomposition task which provides increased flexibility for the algorithm and is shown to provide improvements in highly reverberant scenarios. Unlike many conventional array processing solutions, the proposed method does not require closely-spaced microphones and is independent of source and microphone locations. The algorithm can automatically adapt to unbalanced direct-to-reverberation ratios among different channels, which is useful in blind scenarios in which no information is available about source-to-microphone distances. For a medium vocabulary distant ASR task based on TIMIT utterances, and using clean-trained deep neural network acoustic models, absolute WER improvements of +17.2%, +20.7%, and +23.2% are achieved in single-channel, two-channel, and four-channel scenarios.