Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition

  title={Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition},
  author={Cong-Thanh Do and Yannis Stylianou},
This paper proposes a new method for weighting twodimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR). Auditory saliency is estimated via 2D auditory saliency maps which model the mechanism for allocating human auditory attention. These maps are used to weight T-F representation of speech, namely the 2D magnitude spectrum or spectrogram, prior to features extraction for ASR. Experiments on Aurora-4 corpus… 
1 Citations

Figures and Tables from this paper

Development of a mathematical model of scrambler-type speech-like interference generator for system of prevent speech information from leaking via acoustic and vibration channels
Results indicate the high efficiency of the proposed method of protecting speech information, which takes into account the use of dynamic keys for coding systems, and the connection of third-party sources of speech signals, as well as ringing at the input of the scrambling unit.


Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition.
Gabor features are shown to be more robust against extrinsic variation than the baseline systems without CMS, with relative improvements of 28% and 16% for two training conditions (using only clean training samples or a mixture of noisy and clean utterances, respectively).
Improved Automatic Speech Recognition Using Subband Temporal Envelope Features and Time-Delay Neural Network Denoising Autoencoder
This paper investigates the use of perceptually-motivated subband temporal envelope (STE) features and time-delay neural network (TDNN) denoising autoencoder (DAE) to improve deep neural network
A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks
  • Bo Li, K. Sim
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2014
A robust spectral masking system where power spectral domain masks are predicted using a DNN trained on the same filter-bank features used for acoustic modeling is proposed, motivated by the separation-prior-to-recognition process of the human auditory system.
Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition
  • Chanwoo Kim, R. Stern
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2016
Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing.
Localized spectro-temporal cepstral analysis of speech
A novel speech feature analysis technique based on localized spectro- temporal cepstral analysis of speech that is more robust to noise, and better capture temporal modulations important for recognizing plosive sounds is presented.
A novel approach to soft-mask estimation and Log-Spectral enhancement for robust speech recognition
  • Julien van Hout, A. Alwan
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
Evaluation on the Aurora-2 corpus shows that the proposed approach competes with state-of-the-art front-ends, like ETSI-AFE, MVA or PNCC.
A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech
A novel biologically plausible auditory saliency map is presented to model such saliency based auditory attention, and its usefulness in detecting the prominent syllable and word locations in speech is tested in an unsupervised manner.
Multi-stream spectro-temporal features for robust speech recognition
When used in combination with MFCCs for speech recognition under both clean and noisy conditions, multi-stream spectro-temporal features provide roughly a 30% relative improvement in word-error rate over usingMFCCs alone.
Perceptual linear predictive (PLP) analysis of speech.
  • H. Hermansky
  • Physics
    The Journal of the Acoustical Society of America
  • 1990
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
The role of binary mask patterns in automatic speech recognition in background noise.
The first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes indicates that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech.