RNN Transducer Models for Spoken Language Understanding
@article{Thomas2021RNNTM, title={RNN Transducer Models for Spoken Language Understanding}, author={Samuel Thomas and Hong-Kwang Jeff Kuo and George Saon and Zolt'an Tuske and Brian Kingsbury and Gakuto Kurata and Zvi Kons and Ron Hoory}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={7493-7497} }
We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding (SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a more restrictive case where transcripts are available but not corresponding audio. We show how RNN-T SLU models can be developed starting from pre…
9 Citations
Improving End-to-end Models for Set Prediction in Spoken Language Understanding
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
To improve E2E SLU models when entity spoken order is unknown, a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order are proposed.
Seq2seq and Legacy techniques enabled Chatbot with Voice assistance
- Computer Science2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon)
- 2022
A voice assistant application is built, which takes up real voice recognition, processes the requirements according to the client, and responds well and has the feature of completing tasks without eye contact.
Extending RNN-T-based speech recognition systems with emotion and language classification
- Computer ScienceINTERSPEECH
- 2022
This work extends the STT system for emotion classification through minimal changes, and shows successful results on the IEMOCAP and MELD datasets, and demonstrates state-of-the-art accuracy for the NIST-LRE-07 dataset.
Towards Reducing the Need for Speech Training Data to Build Spoken Language Understanding Systems
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
This paper proposes a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources and shows that these models can be further improved to perform at levels close to similar systems built on the full speech datasets.
A New Data Augmentation Method for Intent Classification Enhancement and its Application on Spoken Conversation Datasets
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
The NNSI reduces the need for manual labeling by automatically selecting highly-ambiguous samples and labeling them with high accuracy by integrating the classifier’s output from a semantically similar group of text samples.
On joint training with interfaces for spoken language understanding
- Computer ScienceINTERSPEECH
- 2022
This paper leverages large-size pretrained ASR and NLU models that are connected by a text interface, and jointly train both models via a sequence loss function, and shows the overall diminishing impact of leveraging pretrained models with increased training data size.
Integrating Dialog History into End-to-End Spoken Language Understanding Systems
- Computer ScienceInterspeech
- 2021
This paper investigates the importance of dialog history and how it can be effectively integrated into endto-end SLU systems, and proposes a proposed RNN transducer (RNN-T) based SLU model that has access to its dialog history in the form of decoded transcripts and SLU labels of previous turns.
End-to-End Spoken Language Understanding using RNN-Transducer ASR
- Computer ScienceArXiv
- 2021
An end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance that is connected to a neural natural language understanding model through a neural interface and improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets is proposed.
Multi-Task Language Modeling for Improving Speech Recognition of Rare Words
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
This paper proposes a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance, and shows that the rescoring model trained with these additional tasks outperforms the baseline rescoring models.
References
SHOWING 1-10 OF 31 REFERENCES
Advancing RNN Transducer Technology for Speech Recognition
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
A novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to additive) and the applicability of i-vector speaker adaptation to RNN-Ts in conjunction with data perturbation are discussed.
End-to-End Spoken Language Understanding Without Full Transcripts
- Computer ScienceINTERSPEECH
- 2020
End-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities are developed and if these E2E SLU models can be trained solely on semantic entity annotations without word-for-word transcripts is investigated.
End-to-End Neural Transformer Based Spoken Language Understanding
- Computer ScienceINTERSPEECH
- 2020
An end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture is introduced.
Large-scale Transfer Learning for Low-resource Spoken Language Understanding
- Computer ScienceINTERSPEECH
- 2020
An attention-based SLU model together with three encoder enhancement strategies to overcome data sparsity challenge, which reduces the risk of over-fitting and augments the ability of the underlying encoder, indirectly.
Improving End-to-End Speech-to-Intent Classification with Reptile
- Computer ScienceINTERSPEECH
- 2020
Though Reptile was originally proposed for model-agnostic meta learning, it is argued that it can also be used to directly learn a target task and result in better generalization than conventional gradient descent.
Rnn-Transducer with Stateless Prediction Network
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The results suggest that the RNNT prediction network does not function as the LM in classical ASR, and instead it merely helps the model align to the input audio, while the RnnT encoder and joint networks capture both the acoustic and the linguistic information.
Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system, and investigated two techniques to improve the S2 I system, including transfer learning and data augmentation, which recover 80% of performance lost due to using limited intent-labeled speech.
Improved End-To-End Spoken Utterance Classification with a Self-Attention Acoustic Classifier
- PhysicsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
It is demonstrated that strong performance can be obtained by the model with acoustic features alone compared to a text classifier on ASR outputs when acoustic and lexical embeddings from these classifiers are combined, accuracy that is on par with human agents can be achieved.
End-to-End Architectures for ASR-Free Spoken Language Understanding
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A set of recurrent architectures for intent classification, tailored to the recently introduced Fluent Speech Commands dataset, where intents are formed as combinations of three slots (action, object, and location), are explored.
Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work proposes a strategy to overcome this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers, and confirms the effectiveness of this approach with experiments on two open-source SLU datasets.