Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors

@article{Kumatani2012MicrophoneAP,
  title={Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors},
  author={Ken'ichi Kumatani and John W. McDonough and Bhiksha Raj},
  journal={IEEE Signal Processing Magazine},
  year={2012},
  volume={29},
  pages={127-140}
}
Distant speech recognition (DSR) holds the promise of the most natural human computer interface because it enables man-machine interactions through speech, without the necessity of donning intrusive body- or head-mounted microphones. Recognizing distant speech robustly, however, remains a challenge. This contribution provides a tutorial overview of DSR systems based on microphone arrays. In particular, we present recent work on acoustic beam forming for DSR, along with experimental results… 
Microphone array processing for distant speech recognition: Towards real-world deployment
  • K. Kumatani, T. Arakawa, I. Tashev
  • Computer Science
    Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference
  • 2012
TLDR
This paper presents recent work on acoustic beamforming for DSR, along with experimental results verifying the effectiveness of the various algorithms described here, and reports the results of speech recognition experiments on data captured with a popular device, the Kinect.
Microphone array processing for distant speech recognition: Spherical arrays
  • J. McDonough, K. Kumatani, B. Raj
  • Computer Science
    Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference
  • 2012
TLDR
This work compares a linear array with 64-channel arrays and a total length of 126 cm to a spherical array with 32 channels and a radius of 4.2 cm and found that these provided word error rates of 9.3% and 10.2%, respectively, on a DSR task.
DISTANT SPEECH RECOGNITION USING MICROPHONE ARRAYS
TLDR
An improved steering vector was proposed to increase the performance of Mininimum Variance Distortionless Response (MVDR) beamformer in real data and reduce the Word Error Rate (WER) of MVDR beamformer.
Channel selection and reverberation-robust automatic speech recognition
TLDR
This thesis is focused on ASR applications in a room environment, where reverberation is the dominant source of distortion, and considers both single- and multi-microphone setups, and provides an overview of the CS measures presented in the literature so far, and compares them experimentally.
Microphone Array Processing Strategies for Distant-Based Automatic Speech Recognition
TLDR
The author's experience shows that applying SA on individual channels and merging the results with ROVER reduces the negative effects of SA reported by others in the field, and illustrates the overall improvement obtained with front-end enhancement techniques in DSR.
Enabling speech applications using Ad-Hoc Microphone Arrays
Microphone arrays are central players in hands-free speech interface applications. The main duty of a microphone array is capturing distant-talking speech with high quality. A microphone array can
Information Fusion Approaches for Distant Speech Recognition in a Multi-microphone Setting
TLDR
Two original solutions are presented, based on information fusion approaches at different levels of the recognition system, one at front-end stage and one at post-decoding stage, namely for the problems of channel selection (CS) and hypothesis combination.
Hybrid acoustic models for distant and multichannel large vocabulary speech recognition
TLDR
The accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.
Recurrent Models for Auditory Attention in Multi-Microphone Distant Speech Recognition
TLDR
This work presents a neural attention network that directly combines multi-channel audio to generate phonetic states without requiring any prior knowledge of the microphone layout or any explicit signal preprocessing for speech enhancement.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 52 REFERENCES
Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate
TLDR
This work compares the accuracy of source localization systems based on only audio features, only video features, as well as a combination of audio and video features using speech data collected during seminars held by actual speakers to reveal that accurate speaker localization is crucial for minimizing the error rate of a far field ASR system.
The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments
TLDR
The collection of an audio-visual corpus of read speech from a number of instrumented meeting rooms suitable for use in continuous speech recognition experiments and is captured using a variety of microphones, including arrays, as well as close-up and wider angle cameras.
To Separate Speech
TLDR
This work describes the system for the recognition of such simultaneous speech and investigates the effect of the filter bank design used to perform subband analysis and synthesis during beamforming on the SSC development data.
Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering
TLDR
A theoretical analysis of noise reduction and dereverberation algorithms based on a microphone array combined with a Wiener postfilter shows an appreciable reduction of acoustic echo and localized noise is obtained and makes the whole system highly attractive for hands-free communication systems.
To separate speech: a system for recognizing simultaneous speech
TLDR
This work describes the system for the recognition of such simultaneous speech and investigates the effect of the filter bank design used to perform subband analysis and synthesis during beamforming on the SSC development data.
Microphone Array Beamforming Approach to Blind Speech Separation
TLDR
This paper presents a microphone array beamforming approach to blind speech separation that does not require a-priori knowledge of the microphone placement and speaker location, making the system directly comparable other blind source separation methods which require no prior knowledge of recording conditions.
The CHiME corpus: a resource and a challenge for computational hearing in multisource environments
TLDR
A new corpus designed for noise-robust speech processing research, CHiME, which includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment.
Bridging the Gap: Towards a Unified Framework for Hands-Free Speech Recognition Using Microphone Arrays
  • M. Seltzer
  • Computer Science
    2008 Hands-Free Speech Communication and Microphone Arrays
  • 2008
TLDR
This paper describes two families of algorithms for hands-free speech recognition using microphone arrays which consider all processing stages to be components of a single system that operates with the common goal of improved recognition accuracy.
Tracking and beamforming for multiple simultaneous speakers with probabilistic data association filters
TLDR
This work generalizes the IEKF, first to a probabilistic data association filter, which incorporates a clutter model for rejection of spurious acoustic events, and then to a joint Probabilistic Data Association filter (JPDAF), which maintains a separate state vector for each active speaker.
...
1
2
3
4
5
...