Perceptual Properties of Current Speech Recognition Technology

  title={Perceptual Properties of Current Speech Recognition Technology},
  author={Hynek Hermansky and Jordan Cohen and Richard M. Stern},
  journal={Proceedings of the IEEE},
In recent years, a number of feature extraction procedures for automatic speech recognition (ASR) systems have been based on models of human auditory processing, and one often hears arguments in favor of implementing knowledge of human auditory perception and cognition into machines for ASR. This paper takes a reverse route, and argues that the engineering techniques for automatic recognition of speech that are already in widespread use are often consistent with some well-known properties of… 

On the Effect of the Implementation of Human Auditory Systems on Q-Log-Based Features for Robustness of Speech Recognition Against Noise

Mimicking human auditory systems as well as applying mean normalization in feature extraction are widely believed to improve the robustness of speech recognition. Traditionally, the normalization is


It is hypothesized that linguistic message in speech, as represented by a string of speech sounds, is coded redundantly in both the time and the frequency domains so that relevant spectral and temporal properties of human hearing can be used in extracting the messages from the noisy speech signal.

Sub-band Autoencoder features for Automatic Speech Recognition

DNN trained on system combination of Mel-filterbank energies and SBAE features provide complementary information present in speech signal to help representation learning.

Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition

A new method for weighting twodimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR) reduces the relative word error rate (WER) in multi-stream ASR.

Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition

The proposed two-layer Unsupervised Deep Auditory Model (UDAM) by stacking two ConvRBMs improves speech recognition performance over Mel filterbank features and further improvements can be achieved by system-level combination of both UDAM features and Mel filter bank features.

Non-Intrusive Estimation of Speech Signal Parameters using a Frame-based Machine Learning Approach

A novel, non-intrusive method is presented that jointly estimates acoustic signal properties associated with the perceptual speech quality, level of reverberation and noise in a speech signal and shows how each type of acoustic parameter correlates with ASR performance in terms of ground truth labels.

Filterbank learning using Convolutional Restricted Boltzmann Machine for speech recognition

  • Hardik B. SailorH. A. Patil
  • Computer Science
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
The developed ConvRBM with sampling from noisy rectified linear units (NReLUs) is trained in an unsupervised way to model speech signal of arbitrary lengths and weights of the model can represent an auditory-like filterbank.

Speech Enhancement using a Deep Mixture of Experts

The proposed Deep Mixture of Experts scheme outperforms other schemes that either do not consider phoneme structure or use a simpler training methodology in the task of speech enhancement.

Non-Intrusive POLQA Estimation of Speech Quality using Recurrent Neural Networks

A novel, non-intrusive estimator that exploits recurrent neural network architectures to predict the intrusive POLQA score of a speech signal in a short time context, based on a novel compressed representation of modulation domain features.



Application of an auditory model to speech recognition.

  • J. Cohen
  • Computer Science
    The Journal of the Acoustical Society of America
  • 1989
A new process includes adaptation, loudness scaling, and mel warping in a front end for the IBM speech-recognition system and tests show that the design is an improvement over previous algorithms.

Analysis of physiologically-motivated signal processing for robust speech recognition

It is shown that feature extraction based on auditory processing provides better performance in the presence of additive background noise than traditional MFCC processing and it is argued that an expansive nonlinearity in the auditory model contributes the most to noise robustness.

Features Based on Auditory Physiology and Perception

  • R. SternN. Morgan
  • Physics
    Techniques for Noise Robustness in Automatic Speech Recognition
  • 2012
The goal of this chapter is to review some of the major ways in which feature extraction schemes based on auditory processing have facilitated greater speech recognition accuracy in recent years, as well as to provide some insight into the nature of current trends and future directions in this area.

Nonlinear enhancement of onset for robust speech recognition

A novel algorithm called Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF) to enhance spectral features for robust speech recognition, especially in reverberant environments is presented.

Should recognizers have ears?

A model of auditory perception as front end for automatic speech recognition.

A front end for automatic speech recognizers is proposed and evaluated which is based on a quantitative model of the "effective" peripheral auditory processing. The model simulates both spectral and

A Performance Monitoring Approach to Fusing Enhanced Spectrogram Channels in Robust Speech Recognition

An implementation of a performance monitoring approach to feature channel integration in robust automatic speech recognition is presented. Motivated by psychophysical evidence in human speech

Physiologically-motivated synchrony-based processing for robust automatic speech recognition

It is shown that theUse of the physiologically-motivated peripheral processing improves recognition accuracy in the presence of both broadband and transient noise, and that the use of the synchrony mechanism provides further improvement beyond that which is provided by the mean rate mechanism.

Continuous speech recognition by statistical methods

  • F. Jelinek
  • Computer Science
    Proceedings of the IEEE
  • 1976
Experimental results are presented that indicate the power of the methods and concern modeling of a speaker and of an acoustic processor, extraction of the models' statistical parameters and hypothesis search procedures and likelihood computations of linguistic decoding.

Auditory processing of speech signals for robust speech recognition in real-world noisy environments

This paper presents a new approach to an auditory model for robust speech recognition in noisy environments. The proposed model consists of cochlear bandpass filters and nonlinear operations in which