Developments and directions in speech recognition and understanding, Part 1 [DSP Education]

  title={Developments and directions in speech recognition and understanding, Part 1 [DSP Education]},
  author={J. Baker and Li Deng and James R. Glass and Sanjeev Khudanpur and Chin-Hui Lee and Nelson Morgan and Douglas D. O'Shaughnessy},
  journal={IEEE Signal Processing Magazine},
To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research that could lead to major paradigm shifts in the field of automatic speech recognition (ASR) and understanding. ASR has been an area of great interest… 

Deep learning: from speech recognition to language and multimodal processing

  • L. Deng
  • Computer Science
    APSIPA Transactions on Signal and Information Processing
  • 2016
The historical path to this transformative success of deep learning in speech recognition is reflected, and a number of key issues in deep learning are discussed, and future directions are analyzed for perceptual tasks such as speech, image, and video, as well as for cognitive tasks involving natural language.

A comparative study of state-of-the-art speech recognition models for English and Dutch

It can be deduced that the size of the dataset is influential on the accuracy of speech recognition systems and the listen, attend and spell model on both English and Dutch datasets outperforms the CNN-BLSTM model.

Automatic Speech Recognition using limited vocabulary: A survey

A comprehensive view of mechanisms behind ASR systems as well as techniques, tools, projects, recent contributions, and possibly future directions in ASR using a limited vocabulary is provided.

Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances

The aim of this article is to describe some of the technological underpinnings of modern LVCSR systems, which are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.

Machine Learning in Automatic Speech Recognition: A Survey

A comprehensive review of common machine learning techniques like artificial neural networks, support vector machines, and Gaussian mixture models along with hidden Markov models employed in ASR is provided.

Machine Learning Paradigms for Speech Recognition: An Overview

  • L. DengXiao Li
  • Computer Science
    IEEE Transactions on Audio, Speech, and Language Processing
  • 2013
This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems, and presents and analyzes recent developments of deep learning and learning with sparse representations.


The main objective of this thesis is to develop an efficient speech recognition system for recognising speaker independent isolated words in Malayalam using two feature techniques which produced the best recognition accuracy called Discrete Wavelet Transforms and Wavelet Packet Decomposition.

Spoken Language Processing: Where Do We Go from Here?

This chapter shows how the growing evidence for an intimate relationship between sensor and motor behaviour in living organisms, the power of negative feedback control to accommodate unpredictable disturbances in real-world environments, and hierarchical models of temporal memory point towards a novel architecture for speech-based human-machine interaction.

Toward growing modular deep neural networks for continuous speech recognition

A growing modular deep neural network for speech recognition is introduced that is pre-trained in a special manner to implement spatiotemporal information of the frame sequences at the input and their labels at the output layer at the same time.

Stacked transformations for foreign accented speech recognition

Novelty in this work is the stack wise combination of multiple different adaptation transformations that have a better fit for the recognition utterances, called Stacked Transformations.



Spoken Language Digital Libraries : The Million Hour Speech Project

The Center for Innovations in Speech and Language at Carnegie Mellon University has launched a grand challenge project to collect and annotate at least one million hours of recorded speech to support research into innovative methodologies in knowledge representation and knowledge acquisition applied to speech recognition and synthesis.

Automatic Speech and Speaker Recognition: Advanced Topics

Automatic Speech and Speaker Recognition: Advanced Topics groups together in a single volume a number of important topics on speech and speaker recognition, topics which are of fundamental importance, but not yet covered in detail in existing textbooks.

RASTA processing of speech

The theoretical and experimental foundations of the RASTA method are reviewed, the relationship with human auditory perception is discussed, the original method is extended to combinations of additive noise and convolutional noise, and an application is shown to speech enhancement.

fMPE: discriminatively trained features for speech recognition

MPE (minimum phone error) is a previously introduced technique for discriminative training of HMM parameters. fMPE applies the same objective function to the features, transforming the data with a

Two-channel speech analysis

It is shown how the EGG can be used as a tool for validating speech processing algorithms and estimating possible lower bounds for both computation and performance of these algorithms, particularly closed-phase speech analysis.

Speech and language processing - an introduction to natural language processing, computational linguistics, and speech recognition

This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.

An introduction to computing with neural nets

This paper provides an introduction to the field of artificial neural nets by reviewing six important neural net models that can be used for pattern classification and exploring how some existing classification and clustering algorithms can be performed using simple neuron-like components.

Rapid speaker adaptation in eigenvoice space

A new model-based speaker adaptation algorithm called the eigenvoice approach, which constrains the adapted model to be a linear combination of a small number of basis vectors obtained offline from a set of reference speakers, and thus greatly reduces the number of free parameters to be estimated from adaptation data.

Perceptual linear predictive (PLP) analysis of speech.

  • H. Hermansky
  • Physics
    The Journal of the Acoustical Society of America
  • 1990
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.

A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)

  • J. Fiscus
  • Computer Science
    1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings
  • 1997
A post-recognition process which models the output generated by multiple ASR systems as independent knowledge sources that can be combined and used to generate an output with reduced error rate.