Speech-Language Pre-Training for End-to-End Spoken Language Understanding

@article{Qian2021SpeechLanguagePF,
  title={Speech-Language Pre-Training for End-to-End Spoken Language Understanding},
  author={Yao Qian and Ximo Bian and Yu Shi and Naoyuki Kanda and Leo Shen and Zhen Xiao and Michael Zeng},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={7458-7462}
}
  • Yao Qian, Ximo Bian, Michael Zeng
  • Published 11 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder… 

Figures and Tables from this paper

End-to-End Spoken Language Understanding using RNN-Transducer ASR
TLDR
An end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance that is connected to a neural natural language understanding model through a neural interface and improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets is proposed.
Speech2Slot: An End-to-End Knowledge-based Slot Filling from Speech
TLDR
Inspired by object detection in computer vision that is to detect the object from an image, this work considers SF as the task of slot detection from speech, and proposes an end-to-end knowledge-based SF model, named Speech- to-Slot (Speech2Slot), to leverage knowledge to detection the boundary of a slot from the speech.
Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding
TLDR
This work proposes a simple and robust integration method for the E2E SLU network with a novel Interface, Continuous Token Interface (CTI), and verifies that the NLU network, pre-trained with Masked Language Model (MLM), can utilize a noisy textual representation of CTI.
Do We Still Need Automatic Speech Recognition for Spoken Language Understanding?
TLDR
It is shown that learned speech features are superior to ASR transcripts on three classification tasks and highlighted the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems
TLDR
This work introduces a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddeddings.
Intent Classification Using Pre-Trained Embeddings For Low Resource Languages
TLDR
A comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios and presents a quantitative analysis of how the performance scales with the number of training examples used per intent.
Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages
TLDR
A comparative study aimed at employing a pre-trained language agnostic acoustic model to perform SLU in low resource scenarios and improves on the state-of-the-art (SOTA) intent classification accuracy.
Exploring Teacher-Student Learning Approach for Multi-Lingual Speech-to-Intent Classification
TLDR
This work employs a teacher-student learning approach to sufficiently extract information from an mBERT model to train a multi-lingual speech model and demonstrates that the teacher- student learning approach obtains an improved performance over the traditional end-to-end intent classification approach in a practical multi-lingsual scenario.
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
TLDR
This work generates new splits that identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets, to allow more realistic and actionable comparisons between different architectures, driving future model development.
Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System
TLDR
A novel approach is proposed that jointly uses both recognized text obtained by the ASR model and a given labeled text to overcome the limited performance of the intent classification task in the spoken dialogue system containing ASR errors.
...
1
2
...

References

SHOWING 1-10 OF 28 REFERENCES
Speech Model Pre-training for End-to-End Spoken Language Understanding
TLDR
A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.
Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models
TLDR
This work proposes a strategy to overcome this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers, and confirms the effectiveness of this approach with experiments on two open-source SLU datasets.
From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding
TLDR
This paper formulate audio to semantic understanding as a sequence-to-sequence problem, and proposes and compares various encoder-decoder based approaches that optimize both modules jointly, in an end- to-end manner.
Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system
TLDR
This paper proposes an ASR-free, end-to-end (E2E) modeling approach to SLU for a cloud-based, modular spoken dialog system (SDS) and evaluates the effectiveness of the approach on crowdsourced data collected from non-native English speakers interacting with a conversational language learning application.
Discriminative Transfer Learning for Optimizing ASR and Semantic Labeling in Task-Oriented Spoken Dialog
TLDR
Transfer learning in a Generative pre-trained Transformer (GPT) is exploited to jointly optimize ASR error correction and semantic labeling in terms of dialog act and slot-value for a given user’s spoken response in the context of SD system (SDS).
Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems
  • Yinghui Huang, H. Kuo, M. Picheny
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system, and investigated two techniques to improve the S2 I system, including transfer learning and data augmentation, which recover 80% of performance lost due to using limited intent-labeled speech.
Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding
TLDR
This paper explores unsupervised pre-training for End-to-end SLU models by learning representations from large-scale raw audios and preserves semantic features which benefit the downstream SLU tasks as the learned model weights are further fine-tuned on the task specific training data.
Learning Spoken Language Representations with Neural Lattice Language Modeling
TLDR
A framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks and reduces the demands of speech data and has better efficiency is proposed.
Unified Language Model Pre-training for Natural Language Understanding and Generation
TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
Few-shot Natural Language Generation for Task-Oriented Dialog
TLDR
FewshotWOZ is presented, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems, and the proposed SC-GPT model significantly outperforms existing methods, measured by various automatic metrics and human evaluations.
...
1
2
3
...