RNN Transducer Models for Spoken Language Understanding

  title={RNN Transducer Models for Spoken Language Understanding},
  author={Samuel Thomas and Hong-Kwang Jeff Kuo and George Saon and Zolt'an Tuske and Brian Kingsbury and Gakuto Kurata and Zvi Kons and Ron Hoory},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Samuel ThomasH. Kuo R. Hoory
  • Published 8 April 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding (SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available but not corresponding audio. We show how RNN-T SLU models can be developed starting from pre… 

Tables from this paper

Improving End-to-end Models for Set Prediction in Spoken Language Understanding

To improve E2E SLU models when entity spoken order is unknown, a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order are proposed.

Seq2seq and Legacy techniques enabled Chatbot with Voice assistance

A voice assistant application is built, which takes up real voice recognition, processes the requirements according to the client, and responds well and has the feature of completing tasks without eye contact.

Extending RNN-T-based speech recognition systems with emotion and language classification

This work extends the STT system for emotion classification through minimal changes, and shows successful results on the IEMOCAP and MELD datasets, and demonstrates state-of-the-art accuracy for the NIST-LRE-07 dataset.

Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems

This paper proposes a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources and shows that these models can be further improved to perform at levels close to similar systems built on the full speech datasets.

A New Data Augmentation Method for Intent Classification Enhancement and its Application on Spoken Conversation Datasets

The NNSI reduces the need for manual labeling by automatically selecting highly-ambiguous samples and labeling them with high accuracy by integrating the classifier’s output from a semantically similar group of text samples.

On joint training with interfaces for spoken language understanding

This paper leverages large-size pretrained ASR and NLU models that are connected by a text interface, and jointly train both models via a sequence loss function, and shows the overall diminishing impact of leveraging pretrained models with increased training data size.

Integrating Dialog History into End-to-End Spoken Language Understanding Systems

This paper investigates the importance of dialog history and how it can be effectively integrated into endto-end SLU systems, and proposes a proposed RNN transducer (RNN-T) based SLU model that has access to its dialog history in the form of decoded transcripts and SLU labels of previous turns.

End-to-End Spoken Language Understanding using RNN-Transducer ASR

An end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance that is connected to a neural natural language understanding model through a neural interface and improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets is proposed.

Multi-Task Language Modeling for Improving Speech Recognition of Rare Words

  • C. YangLinda Liu I. Bulyko
  • Computer Science
    2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2021
This paper proposes a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance, and shows that the rescoring model trained with these additional tasks outperforms the baseline rescoring models.



Advancing RNN Transducer Technology for Speech Recognition

A novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to additive) and the applicability of i-vector speaker adaptation to RNN-Ts in conjunction with data perturbation are discussed.

End-to-End Spoken Language Understanding Without Full Transcripts

End-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities are developed and if these E2E SLU models can be trained solely on semantic entity annotations without word-for-word transcripts is investigated.

End-to-End Neural Transformer Based Spoken Language Understanding

An end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture is introduced.

Large-scale Transfer Learning for Low-resource Spoken Language Understanding

An attention-based SLU model together with three encoder enhancement strategies to overcome data sparsity challenge, which reduces the risk of over-fitting and augments the ability of the underlying encoder, indirectly.

Improving End-to-End Speech-to-Intent Classification with Reptile

Though Reptile was originally proposed for model-agnostic meta learning, it is argued that it can also be used to directly learn a target task and result in better generalization than conventional gradient descent.

Rnn-Transducer with Stateless Prediction Network

The results suggest that the RNNT prediction network does not function as the LM in classical ASR, and instead it merely helps the model align to the input audio, while the RnnT encoder and joint networks capture both the acoustic and the linguistic information.

Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems

  • Yinghui HuangH. Kuo M. Picheny
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system, and investigated two techniques to improve the S2 I system, including transfer learning and data augmentation, which recover 80% of performance lost due to using limited intent-labeled speech.

Improved End-To-End Spoken Utterance Classification with a Self-Attention Acoustic Classifier

It is demonstrated that strong performance can be obtained by the model with acoustic features alone compared to a text classifier on ASR outputs when acoustic and lexical embeddings from these classifiers are combined, accuracy that is on par with human agents can be achieved.

End-to-End Architectures for ASR-Free Spoken Language Understanding

A set of recurrent architectures for intent classification, tailored to the recently introduced Fluent Speech Commands dataset, where intents are formed as combinations of three slots (action, object, and location), are explored.

Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models

This work proposes a strategy to overcome this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers, and confirms the effectiveness of this approach with experiments on two open-source SLU datasets.