End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

  title={End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting},
  author={Thierry Desot and François Portet and Michel Vacher},

Toward Low-Cost End-to-End Spoken Language Understanding

It is shown that it is possible to reduce the learning cost while maintaining state-of-the-art performance and using SSL models, and an extensive analysis is proposed where the cost of the models is measured in terms of training time and electric energy consumption.

Taxonomic Classification of IoT Smart Home Voice Control

A taxonomy of the voice control technologies present in commercial smart home systems is presented, and open-source libraries and devices that could support a cloud-free voice assistant are discussed.


The most used mechanism is image recognition, which used all of the sectors and the least are speech generation and machine learning.



Corpus Generation for Voice Command in Smart Home and the Effect of Speech Synthesis on End-to-End SLU

This work presents the automatic generation process of a synthetic semantically-annotated corpus of French commands for smart-home to train pipeline and End-to-End (E2E) SLU models to jointly perform ASR and NLU.

SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches

Results show that the E2E approach can reach performances similar to a state-of-the art pipeline SLU despite a higher WER than the pipeline approach, and can benefit from artificially generated data to exhibit lower Concept Error Rates than the Pipeline baseline for slot recognition.

Towards End-to-end Spoken Language Understanding

This study showed that the trained model can achieve reasonable good result and demonstrated that the model can capture the semantic attention directly from the audio features.

Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system

This paper proposes an ASR-free, end-to-end (E2E) modeling approach to SLU for a cloud-based, modular spoken dialog system (SDS) and evaluates the effectiveness of the approach on crowdsourced data collected from non-native English speakers interacting with a conversational language learning application.

Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

This work presents models that extract utterance intent directly from speech without intermediate text output and contrast these methods to a jointly trained end-to-end joint SLU model, consisting of ASR and NLU subsystems which are connected by a neural network based interface instead of text, that produces transcripts as well as NLU interpretation.

Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models

This work proposes a strategy to overcome this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers, and confirms the effectiveness of this approach with experiments on two open-source SLU datasets.

Towards End-to-End spoken intent recognition in smart home

Experiments on a corpus of voice commands acquired in a real smart home reveal that the state-of-the art pipeline baseline is still superior to the E2E approach, however, using artificial data generation techniques it is shown that significant improvement to theE2E model can be brought to reach competitive performances.

Learning Natural Language Understanding Systems from Unaligned Labels for Voice Command in Smart Homes

This paper proposes to use a sequence-to-sequence neural architecture to train NLU models which do not need aligned data and can jointly learn the intent, slot-label and slot-value prediction tasks.

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

A novel training method is proposed that enables pretrained contextual embeddings to process acoustic features and is based on the teacher-student framework across speech and text modalities that aligns the acoustic and the semantic latent spaces.