Speech-Language Pre-Training for End-to-End Spoken Language Understanding
@article{Qian2021SpeechLanguagePF, title={Speech-Language Pre-Training for End-to-End Spoken Language Understanding}, author={Yao Qian and Ximo Bian and Yu Shi and Naoyuki Kanda and Leo Shen and Zhen Xiao and Michael Zeng}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={7458-7462} }
End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder…
13 Citations
End-to-End Spoken Language Understanding using RNN-Transducer ASR
- Computer ScienceArXiv
- 2021
An end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance that is connected to a neural natural language understanding model through a neural interface and improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets is proposed.
Speech2Slot: An End-to-End Knowledge-based Slot Filling from Speech
- Computer ScienceArXiv
- 2021
Inspired by object detection in computer vision that is to detect the object from an image, this work considers SF as the task of slot detection from speech, and proposes an end-to-end knowledge-based SF model, named Speech- to-Slot (Speech2Slot), to leverage knowledge to detection the boundary of a slot from the speech.
Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding
- Computer ScienceArXiv
- 2021
This work proposes a simple and robust integration method for the E2E SLU network with a novel Interface, Continuous Token Interface (CTI), and verifies that the NLU network, pre-trained with Masked Language Model (MLM), can utilize a noisy textual representation of CTI.
Do We Still Need Automatic Speech Recognition for Spoken Language Understanding?
- Computer ScienceArXiv
- 2021
It is shown that learned speech features are superior to ASR transcripts on three classification tasks and highlighted the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems
- Computer ScienceArXiv
- 2022
This work introduces a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddeddings.
Intent Classification Using Pre-Trained Embeddings For Low Resource Languages
- Computer ScienceArXiv
- 2021
A comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios and presents a quantitative analysis of how the performance scales with the number of training examples used per intent.
Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages
- Computer Science
- 2021
A comparative study aimed at employing a pre-trained language agnostic acoustic model to perform SLU in low resource scenarios and improves on the state-of-the-art (SOTA) intent classification accuracy.
Exploring Teacher-Student Learning Approach for Multi-Lingual Speech-to-Intent Classification
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
This work employs a teacher-student learning approach to sufficiently extract information from an mBERT model to train a multi-lingual speech model and demonstrates that the teacher- student learning approach obtains an improved performance over the traditional end-to-end intent classification approach in a practical multi-lingsual scenario.
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
- Computer ScienceInterspeech
- 2021
This work generates new splits that identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets, to allow more realistic and actionable comparisons between different architectures, driving future model development.
Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System
- Computer ScienceSensors
- 2022
A novel approach is proposed that jointly uses both recognized text obtained by the ASR model and a given labeled text to overcome the limited performance of the intent classification task in the spoken dialogue system containing ASR errors.
References
SHOWING 1-10 OF 28 REFERENCES
Speech Model Pre-training for End-to-End Spoken Language Understanding
- Computer ScienceINTERSPEECH
- 2019
A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.
Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work proposes a strategy to overcome this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers, and confirms the effectiveness of this approach with experiments on two open-source SLU datasets.
From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding
- Computer Science2018 IEEE Spoken Language Technology Workshop (SLT)
- 2018
This paper formulate audio to semantic understanding as a sequence-to-sequence problem, and proposes and compares various encoder-decoder based approaches that optimize both modules jointly, in an end- to-end manner.
Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
This paper proposes an ASR-free, end-to-end (E2E) modeling approach to SLU for a cloud-based, modular spoken dialog system (SDS) and evaluates the effectiveness of the approach on crowdsourced data collected from non-native English speakers interacting with a conversational language learning application.
Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-Oriented Spoken Dialog
- Computer ScienceINTERSPEECH
- 2020
Transfer learning in a Generative pre-trained Transformer (GPT) is exploited to jointly optimize ASR error correction and semantic labeling in terms of dialog act and slot-value for a given user’s spoken response in the context of SD system (SDS).
Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system, and investigated two techniques to improve the S2 I system, including transfer learning and data augmentation, which recover 80% of performance lost due to using limited intent-labeled speech.
Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper explores unsupervised pre-training for End-to-end SLU models by learning representations from large-scale raw audios and preserves semantic features which benefit the downstream SLU tasks as the learned model weights are further fine-tuned on the task specific training data.
Learning Spoken Language Representations with Neural Lattice Language Modeling
- Computer ScienceACL
- 2020
A framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks and reduces the demands of speech data and has better efficiency is proposed.
Unified Language Model Pre-training for Natural Language Understanding and Generation
- Computer ScienceNeurIPS
- 2019
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
Few-shot Natural Language Generation for Task-Oriented Dialog
- Computer ScienceFINDINGS
- 2020
FewshotWOZ is presented, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems, and the proposed SC-GPT model significantly outperforms existing methods, measured by various automatic metrics and human evaluations.