Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

  title={Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking},
  author={Sergei Astapov and Gleb I Svirskiy and Aleksandr Lavrentyev and Tatyana Prisyach and Dmitriy Popov and Dmitriy Ubskiy and Vladimir Kabarov},
Currently, the quality of Distant Speech Recognition (DSR) systems cannot match the quality of speech recognition on clean speech acquired by close-talking microphones. The main problems behind DSR are situated with the far field nature of data, one of which is unpredictable occurrence of acoustic events and scenes, which distort the signal’s speech component. Application of acoustic event detection and classification (AEC) in conjunction with DSR can benefit speech enhancement and improve DSR… 
1 Citations
Directional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization
To reduce the shortcomings of signal acquisition with large-aperture arrays and reduce the impact of noise and interference, a Time-Frequency masking approach is proposed applying Complex Angular Central Gaussian Mixture Models for sound source directional clustering and inter-component phase analysis for polyharmonic speech component restoration.


The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
The 5th CHiME Challenge is introduced, which considers the task of distant multi-microphone conversational ASR in real home environments and describes the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.
The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions
A database designed to evaluate the performance of speech recognition algorithms in noisy conditions and recognition results are presented for the first standard DSR feature extraction scheme that is based on a cepstral analysis.
Acoustic Beamforming for Speaker Diarization of Meetings
The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions
The IRM is extended to reverberant conditions where the direct sound and early reflections of target speech are regarded as the desired signal and provides substantial improvements in speech intelligibility and speech quality over the unprocessed mixture signals under various noisy and reverberants conditions.
Enhanced voice activity detection using acoustic event detection and classification
A novel voice activity detection technique that consists of two major modules: 1) classification and 2) detection module that enables the efficient operation of speech recognition in the continuously listening environment without any touch and/or key input is proposed.
Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home
The structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition and performance is evaluated using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in earlier work.
The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms
The evaluation of five baseline Vad systems on the QUT-NOISE-TIMIT corpus is conducted to validate the data and show that the variety of noise available will allow for better evaluation of VAD systems than existing approaches in the literature.
Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks
The use of a recurrent neural network to enhance acoustic parameters prior to training showed that the voice built with enhanced parameters was ranked significantly higher than the ones trained with noisy speech and speech that has been enhanced using a conventional enhancement system.
Simultaneous Speech Recognition and Acoustic Event Detection Using an LSTM-CTC Acoustic Model and a WFST Decoder
Experimental results show that precision and recall rates of filler detection can be controlled by the filler confidence score, and word fragments can be detected without registering all possible word fragments to the lexicon.
English Conversational Telephone Speech Recognition by Humans and Machines
An independent set of human performance measurements on two conversational tasks are performed and it is found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve.