DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

@article{Gogate2018DNNDS,
  title={DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation},
  author={M. Gogate and Ahsan Adeel and R. Marxer and J. Barker and A. Hussain},
  journal={ArXiv},
  year={2018},
  volume={abs/1808.00060}
}
Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and… Expand
CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement
TLDR
A causal, language, noise and speaker independent AV deep neural network (DNN) architecture for speech enhancement (SE) that exploits the noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve the speech intelligibility is presented. Expand
Audio speech enhancement using masks derived from visual speech
TLDR
Experiments in large unconstrained vocabulary speech confirm that the model architectures and approaches developed can generalise to unconstrains speech across noise independent conditions and can be considered for monaural speaker dependent real-world applications. Expand
Deep Neural Network Driven Binaural Audio Visual Speech Separation
TLDR
A deep neural network (DNN) that ingest the binaural sounds received at the two ears as well as the visual frames to selectively suppress the competing noise sources individually at both ears is presented. Expand
Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder
TLDR
This paper develops a conditional VAE (CVAE) where the audio speech generative process is conditioned on visual information of the lip region, and it improves the speech enhancement performance compared with the audio-only VAE model. Expand
Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments
TLDR
A novel context-aware AV speech enhancement framework that contextually exploits AV cues with respect to different operating conditions, in order to estimate clean audio, without requiring any prior SNR estimation is introduced. Expand
Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders
TLDR
This article develops a conditional VAE (CVAE) where the audio speech generative process is conditioned on visual information of the lip region, and it improves the speech enhancement performance compared with the audio-only VAE model. Expand
Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System
TLDR
A first of its kind audio-visual corpus comprising 2500 utterances from 209 speakers, recorded in real noisy environments including social gatherings, streets, cafeterias and restaurants and a baseline deep neural network (DNN) based spectral mask estimation model for speech enhancement. Expand
Variance based time-frequency mask estimation for unsupervised speech enhancement
TLDR
The experimental results showed large improvements in terms of the perceptual evaluation of speech quality, segmental SNR, residual noise distortion, speech distortion and speech distortion over that achieved with competing methods at different input SNRs. Expand
Audiovisual Speaker Conversion: Jointly and Simultaneously Transforming Facial Expression and Acoustic Characteristics
TLDR
An audiovisual speaker conversion method that combines facial and acoustic features together makes it possible for the converted voice and facial expressions to be highly correlated and for the generated target speaker to appear and sound natural. Expand
Novel Deep Convolutional Neural Network-Based Contextual Recognition of Arabic Handwritten Scripts
TLDR
This paper proposes a supervised Convolutional Neural Network (CNN) model that contextually extracts optimal features and employs batch normalization and dropout regularization parameters to address the challenges of recognizing offline handwritten Arabic text, including isolated digits, characters, and words. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 22 REFERENCES
DNN Based Mask Estimation for Supervised Speech Separation
TLDR
This chapter introduces deep neural network (DNN) based mask estimation for supervised speech separation, which learns a mapping function from acoustic features extracted from noisy speech to an ideal mask. Expand
Long short-term memory for speaker generalization in supervised speech separation.
TLDR
A separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech and which substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Expand
Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition
TLDR
An in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR) and a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks and HMM are performed. Expand
Improving automatic speech recognition in spatially-aware hearing aids
TLDR
Possibility to improve ASR based on binaural hearing aid signals in complex acoustic scenes is investigated using a recently developed method that employs probabilistic information about the location of a target speaker (and a simultaneous localized masker) for robust real-time localization. Expand
Audio-visual Convolutive Blind Source Separation
We present a novel method for speech separation from their audio mixtures using the audio-visual coherence. It consists of two stages: in the off-line training process, we use the Gaussian mixtureExpand
Ideal ratio mask estimation using deep neural networks for robust speech recognition
  • A. Narayanan, Deliang Wang
  • Computer Science
  • 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
TLDR
The proposed feature enhancement algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. Expand
Neural network based spectral mask estimation for acoustic beamforming
TLDR
A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients. Expand
A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation
TLDR
This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation that aim to maximise the hit mi- nus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. Expand
Perceptual learning for speech in noise after application of binary time-frequency masks.
TLDR
The study reported here assessed the effect of training on the recognition of speech in noise after processing by ideal TF masks that did not restore perfect speech intelligibility. Expand
Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises.
TLDR
The results indicate that DNN-based supervised speech segregation with large-scale training is a very promising approach for generalization to new acoustic environments. Expand
...
1
2
3
...