• Corpus ID: 16300834

Recognition of spontaneous conversational speech using long short-term memory phoneme predictions

@inproceedings{Wllmer2010RecognitionOS,
  title={Recognition of spontaneous conversational speech using long short-term memory phoneme predictions},
  author={Martin W{\"o}llmer and Florian Eyben and Bj{\"o}rn Schuller and Gerhard Rigoll},
  booktitle={INTERSPEECH},
  year={2010}
}
We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long ShortTerm Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary… 

Figures and Tables from this paper

A novel bottleneck-BLSTM front-end for feature-level context modeling in conversational speech recognition
We present a novel automatic speech recognition (ASR) front-end that unites Long Short-Term Memory context modeling, bidirectional speech processing, and bottleneck (BN) networks for enhanced Tandem
A multi-stream ASR framework for BLSTM modeling of conversational speech
TLDR
This paper extends the principle of joint BLSTM and triphone modeling to a multi-stream system which uses MFCC features andBLSTM predictions as observations originating from two independent data streams and shows that this technique prevails over a recently proposed single-stream Tandem system as well as over a conventional HMM recognizer.
Feature Frame Stacking in RNN-Based Tandem ASR Systems - Learned vs. Predefined Context
TLDR
Empirical evidence is provided for the intuition that BLSTM networks redundantize frame stacking while RNNs profit from predefined feature-level context.
Enhancing Spontaneous Speech Recognition with BLSTM Features
TLDR
This paper integrates the BLSTM principle into a Tandem front-end for probabilistic feature extraction for spontaneous speech recognition and shows that this concept prevails over recently published architectures for feature-level context modeling.
Localization of non-linguistic events in spontaneous speech by Non-Negative Matrix Factorization and Long Short-Term Memory
TLDR
A novel tandem approach is introduced by integrating likelihood features derived from NMF into Bidirectional Long Short-Term Memory Recurrent Neural Networks (BLSTM-RNNs) in order to dynamically localize non-linguistic events, i.
Probabilistic asr feature extraction applying context-sensitive connectionist temporal classification networks
TLDR
In challenging ASR scenarios involving highly spontaneous, disfluent, and noisy speech, the BN-CTC front-end leads to remarkable word accuracy improvements and prevails over a series of previously introduced BLSTM-based ASR systems.
Feature combination and stacking of recurrent and non-recurrent neural networks for LVCSR
TLDR
The phoneme posterior estimates derived from an RNN lead to a significant improvement over the result of the MLPs and achieve a 5% relative better word error rate (WER) with much less parameters.
Computational Assessment of Interest in Speech—Facing the Real-Life Challenge
TLDR
A fully automatic combination of brute-forced acoustic features, linguistic analysis, and non-linguistic vocalizations, exploiting cross-entity information in an early feature fusion is introduced.
Tandem decoding of children's speech for keyword detection in a child-robot interaction scenario
TLDR
The FAU Aibo Emotion Corpus is used which contains emotionally colored spontaneous children's speech recorded in a child-robot interaction scenario and the Tandem model prevails over a triphone-based Hidden Markov Model approach.
...
1
2
3
...

References

SHOWING 1-10 OF 24 REFERENCES
Tandem acoustic modeling in large-vocabulary recognition
  • D. Ellis, Rita Singh, S. Sivadas
  • Computer Science
    2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221)
  • 2001
TLDR
It is found that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features, but these improvements do not carry over to context-dependent models.
Robust in-car spelling recognition - a tandem BLSTM-HMM approach
TLDR
A novel Tandem spelling recogniser is proposed, combining a Hidden Markov Model (HMM) with a discriminatively trained bidirectional Long Short-Term Memory (BLSTM) recurrent neural net, which makes the Tandem BLSTM-HMM robust with respect to speech signal disturbances at extremely low signal-to-noise ratios and mismatches between training and test noise conditions.
Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks
TLDR
A new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding and overcomes the drawbacks of generative HMM modeling.
A Tandem BLSTM-DBN Architecture for Keyword Spotting with Enhanced Context Modeling
TLDR
A novel architecture for keyword spotting which is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net, based on a phoneme recognizer and uses a hidden garbage variable to discriminate between keywords and arbitrary speech.
Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition
TLDR
In this paper, two experiments on the TIMIT speech corpus with bidirectional and unidirectional Long Short Term Memory networks are carried out and it is found that a hybrid BLSTM-HMM system improves on an equivalent traditional HMM system.
Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework
TLDR
This article proposes a new architecture for vocabulary independent keyword detection as needed for cognitive virtual agents such as the SEMAINE system and evaluates the Tandem BLSTM-DBN technique on both read speech and spontaneous emotional speech to show that it significantly outperforms conventional Hidden Markov Model-based approaches for both application scenarios.
Enhanced Phone Posteriors for Improving Speech Recognition Systems
TLDR
This paper proposes two approaches for hierarchically enhancing phone posteriors by integrating long acoustic context, as well as phonetic and lexical knowledge, in hidden Markov model (HMM) forward-backward recursions.
Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening
TLDR
A novel technique for incremental recognition of the user's emotional state as it is applied in a sensitive artificial listener (SAL) system designed for socially competent human-machine communication.
MMIE training of large vocabulary recognition systems
...
1
2
3
...