SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing

  title={SRIB-LEAP submission to Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing},
  author={Raghu G. Raj and Rohit Kumar and M. K. Jayesh and Anurenjan Purushothaman and Sriram Ganapathy and M. Ali Basha Shaik},
This paper presents the details of the SRIB-LEAP submission to the ConferencingSpeech challenge 2021 . The challenge in-volved the task of multi-channel speech enhancement to improve the quality of far field speech from microphone arrays in a video conferencing room. We propose a two stage method involving a beamformer followed by single channel enhancement. For the beamformer, we incorporated self-attention mechanism as inter-channel processing layer in the filter-and-sum network (FaSNet), an… 

Figures and Tables from this paper



3-D CNN Models for Far-Field Multi-Channel Speech Recognition

A three dimensional (3-D) convolutional neural network (CNN) architecture for multi-channel far-field ASR, which processes time, frequency & channel dimensions of the input spectrogram to learn representations using Convolutional layers.

Deep Learning Based Dereverberation of Temporal Envelopes for Robust Speech Recognition

The proposed neural enhancement model performs an envelop gain based enhancement of temporal envelopes and it consists of a series of convolutional and recurrent neural network layers that are used to generate features for automatic speech recognition (ASR).

Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture.

FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing

Experiments show that despite its small model size, FaSNet is able to outperform several traditional oracle beamformers with respect to scale-invariant signal-to-noise ratio (SI-SNR) in reverberant speech enhancement and separation tasks.

Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise

It is shown that BLSTM networks are well-suited for mapping from noisy to clean speech features and that the obtained recognition performance gain is partly complementary to improvements via additional techniques such as speech enhancement by non-negative matrix factorization and probabilistic feature generation by Bottleneck-BLSTM Networks.

A Regression Approach to Speech Enhancement Based on Deep Neural Networks

The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.

End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

This paper proposes transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation based on the filter-and-sum network, and shows how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays.

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

A large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi- Speakers Text-to-Speech systems and a robust synthesis model that is able to achieve zero-shot voice cloning is presented.

Unsupervised Neural Mask Estimator for Generalized Eigen-Value Beamforming Based Asr

The ASR results for the proposed approach provide performances that are significantly better than a teacher model trained on an out-of-domain dataset and on par with the oracle mask estimators trained on the in- domain dataset.

Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs

A new model has been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay, known as perceptual evaluation of speech quality (PESQ).