Incremental processing of noisy user utterances in the spoken language understanding task
@inproceedings{Constantin2019IncrementalPO, title={Incremental processing of noisy user utterances in the spoken language understanding task}, author={Stefan Constantin and Jan Niehues and Alexander H. Waibel}, booktitle={Conference on Empirical Methods in Natural Language Processing}, year={2019} }
The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved through incremental processing of the user utterance and starting subactions while the utterance is…
3 Citations
Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations
- Computer ScienceArXiv
- 2022
This work proposes to apply a method devel-oped in the context of voice-assistant to the problem of online real time caller’s intent detection in human-human spoken interactions using a dual architecture in which two LSTMs are jointly trained: one predicting the Intent Boundary (IB) and then other predicting the intent class at the IB.
Bimodal Speech Emotion Recognition Using Pre-Trained Language Models
- Computer ScienceArXiv
- 2019
Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be fine-tuned for text emotion…
Not So Fast, Classifier – Accuracy and Entropy Reduction in Incremental Intent Classification
- Computer Science, PhilosophyNLP4CONVAI
- 2021
InCLINC, a dataset of partial and full utterances with human annotations of plausible intent labels for different portions of each utterance, is released as an upper (human) baseline for incremental intent classification and entropy reduction is proposed as a measure of human annotators’ convergence on an interpretation (i.e. intent label).
References
SHOWING 1-10 OF 21 REFERENCES
Low-Latency Neural Speech Translation
- Computer ScienceINTERSPEECH
- 2018
It is shown that NMT systems can be adapted to scenarios where no task-specific training data is available, and the number of corrections displayed during incremental output construction is reduced by 45%, without a decrease in translation quality.
Multi-task learning to improve natural language understanding
- Computer ScienceArXiv
- 2018
This work trains out-of-domain real data alongside in-domain synthetic data to improve natural language understanding and uses an attention-based encoder-decoder model to improve the F1-score over strong baselines.
Enhancing Backchannel Prediction Using Word Embeddings
- Computer ScienceINTERSPEECH
- 2017
This work refined this approach by evaluating different methods to add timed word embeddings via word2vec and showed that adding linguistic features improves the performance over a prediction system that only uses acoustic features.
Toward Robust Neural Machine Translation for Noisy Input Sequences
- Computer ScienceIWSLT
- 2017
It is shown that with a simple generative noise model, moderate gains can be achieved in translating erroneous speech transcripts, provided that type and amount of noise are properly calibrated.
Building Real-Time Speech Recognition Without CMVN
- Computer ScienceSPECOM
- 2018
This paper proposes a feature extraction architecture which can transform unnormalized log mel features to normalized bottleneck features without using historical data and empirically shows that mean and variance normalization is not critical for training neural networks on speech data.
Learning to Buy Time : A Data-Driven Model For Avoiding Silence While Task-Related Information Cannot Yet Be Presented
- Computer Science
- 2018
It is concluded that “buying time” in a natural fashion is possible and beneficial for interaction quality, but only if sequencing constraints found in natural data are reproduced.
The 2015 KIT IWSLT speech-to-text systems for English and German
- Computer ScienceIWSLT
- 2015
This paper describes the German and English Speechto-Text systems for the 2015 IWSLT evaluation campaign, which focuses on the transcription of unsegmented TED talks and produces a final hypothesis that has a significantly lower WER than any of the individual subsystems.
Very Deep Self-Attention Networks for End-to-End Speech Recognition
- Computer ScienceINTERSPEECH
- 2019
This work proposes to use self-attention via the Transformer architecture as an alternative to time-delay neural networks and shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems.
Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM
- Computer ScienceINTERSPEECH
- 2016
Experimental results show the power of a holistic multi-domain, multi-task modeling approach to estimate complete semantic frames for all user utterances addressed to a conversational system over alternative methods based on single domain/task deep learning.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.