Incremental processing of noisy user utterances in the spoken language understanding task

  title={Incremental processing of noisy user utterances in the spoken language understanding task},
  author={Stefan Constantin and Jan Niehues and Alexander H. Waibel},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved through incremental processing of the user utterance and starting subactions while the utterance is… 

Figures and Tables from this paper

Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations

This work proposes to apply a method devel-oped in the context of voice-assistant to the problem of online real time caller’s intent detection in human-human spoken interactions using a dual architecture in which two LSTMs are jointly trained: one predicting the Intent Boundary (IB) and then other predicting the intent class at the IB.

Bimodal Speech Emotion Recognition Using Pre-Trained Language Models

Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be fine-tuned for text emotion

Not So Fast, Classifier – Accuracy and Entropy Reduction in Incremental Intent Classification

InCLINC, a dataset of partial and full utterances with human annotations of plausible intent labels for different portions of each utterance, is released as an upper (human) baseline for incremental intent classification and entropy reduction is proposed as a measure of human annotators’ convergence on an interpretation (i.e. intent label).



Low-Latency Neural Speech Translation

It is shown that NMT systems can be adapted to scenarios where no task-specific training data is available, and the number of corrections displayed during incremental output construction is reduced by 45%, without a decrease in translation quality.

Multi-task learning to improve natural language understanding

This work trains out-of-domain real data alongside in-domain synthetic data to improve natural language understanding and uses an attention-based encoder-decoder model to improve the F1-score over strong baselines.

Enhancing Backchannel Prediction Using Word Embeddings

This work refined this approach by evaluating different methods to add timed word embeddings via word2vec and showed that adding linguistic features improves the performance over a prediction system that only uses acoustic features.

Toward Robust Neural Machine Translation for Noisy Input Sequences

It is shown that with a simple generative noise model, moderate gains can be achieved in translating erroneous speech transcripts, provided that type and amount of noise are properly calibrated.

Building Real-Time Speech Recognition Without CMVN

This paper proposes a feature extraction architecture which can transform unnormalized log mel features to normalized bottleneck features without using historical data and empirically shows that mean and variance normalization is not critical for training neural networks on speech data.

Learning to Buy Time : A Data-Driven Model For Avoiding Silence While Task-Related Information Cannot Yet Be Presented

It is concluded that “buying time” in a natural fashion is possible and beneficial for interaction quality, but only if sequencing constraints found in natural data are reproduced.

The 2015 KIT IWSLT speech-to-text systems for English and German

This paper describes the German and English Speechto-Text systems for the 2015 IWSLT evaluation campaign, which focuses on the transcription of unsegmented TED talks and produces a final hypothesis that has a significantly lower WER than any of the individual subsystems.

Very Deep Self-Attention Networks for End-to-End Speech Recognition

This work proposes to use self-attention via the Transformer architecture as an alternative to time-delay neural networks and shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems.

Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM

Experimental results show the power of a holistic multi-domain, multi-task modeling approach to estimate complete semantic frames for all user utterances addressed to a conversational system over alternative methods based on single domain/task deep learning.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.