Lexicon-Free Conversational Speech Recognition with Neural Networks

@inproceedings{Maas2015LexiconFreeCS,
  title={Lexicon-Free Conversational Speech Recognition with Neural Networks},
  author={Andrew L. Maas and Ziang Xie and Dan Jurafsky and A. Ng},
  booktitle={NAACL},
  year={2015}
}
We present an approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure. This approach eliminates much of the complex infrastructure of modern speech recognition systems, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks. The system naturally handles out of vocabulary words and spoken word fragments. We demonstrate… 

Figures and Tables from this paper

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
A joint word-character A2W model that learns to first spell the word and then recognize it and provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.
Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
This paper presents the first results employing direct acoustics-to-word CTC models on two well-known public benchmark tasks: Switchboard and CallHome, and presents rescoring results on CTC word model lattices to quantify the performance benefits of a LM, and contrast the performance of word and phone C TC models.
Word-level Speech Recognition with a Dynamic Lexicon
TLDR
It is shown that the direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition and that the word-level embeddings the authors learn contain significant acoustic information, making them more suitable for use in speech recognition.
Building DNN acoustic models for large vocabulary speech recognition
An Exploration of Directly Using Word as ACOUSTIC Modeling Unit for Speech Recognition
TLDR
This study systematically explore to use word as acoustic modeling unit for conversational speech recognition by replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, which greatly simplify conventional hybrid speech recognition system.
Character-level incremental speech recognition with recurrent neural networks
  • Kyuyeon Hwang, Wonyong Sung
  • Computer Science
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
TLDR
This work proposes tree-based online beam search with additional depth-pruning, which enables the system to process infinitely long input speech with low latency and not only responds quickly on speech but also can dictate out-of-vocabulary (OOV) words according to pronunciation.
End to End Speech Recognition System
TLDR
A end to end speech recognition system that directly transcribes the audio data with text/phonemes is explained and the system tries to replace the conventional speech recognition pipeline by a single recurrent neural network (RNN) architecture based on the combination of a deep bidirectional LSTM recurrent Neural network architecture and the Connectionist Temporal Classification objective function.
Zero-shot Learning for Speech Recognition with Universal Phonetic Model
TLDR
This work addresses the problem of building an acoustic model for languages with zero audio resources, and adopts the idea of zero-shot learning, and decomposes phonemes into corresponding phonetic attributes such as vowel and consonant.
Who Needs Words? Lexicon-Free Speech Recognition
TLDR
This paper shows that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 23 REFERENCES
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances
TLDR
The aim of this article is to describe some of the technological underpinnings of modern LVCSR systems, which are not robust to mismatched training and test conditions and cannot handle context as well as human listeners despite being trained on thousands of hours of speech and billions of words of text.
Recurrent neural network based language model
TLDR
Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.
Connectionist Speech Recognition: A Hybrid Approach
From the Publisher: Connectionist Speech Recognition: A Hybrid Approach describes the theory and implementation of a method to incorporate neural network approaches into state-of-the-art continuous
On rectified linear units for speech processing
TLDR
This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
The Kaldi Speech Recognition Toolkit
TLDR
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Large vocabulary continuous speech recognition with context-dependent DBN-HMMS
TLDR
This work proposes a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task.
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
...
1
2
3
...