ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

  title={ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling},
  author={Ashish Shenoy and S. Bodapati and Katrin Kirchhoff},
Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language… Expand

Figures and Tables from this paper

Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition
This paper seeks to represent the historical context information of an utterance as graph-structured data so as to distill cross-utterances, global word interaction relationships among utterances in ASR N-best reranking. Expand
Remember the context! ASR slot error correction through memorization
Accurate recognition of slot values such as domain specific words or named entities by automatic speech recognition (ASR) systems forms the core of the Goal-oriented Dialogue Systems. Although it isExpand


Contextual Language Model Adaptation for Conversational Agents
A DNN-based method to adapt the LM to each user-agent interaction based on generalized contextual information, by predicting an optimal, context-dependent set of LM interpolation weights is presented. Expand
Transformer Language Models with LSTM-Based Cross-Utterance Information Representation
  • G. Sun, C. Zhang, P. Woodland
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
The R-TLM which uses hidden states in a long short-term memory (LSTM) LM to encode the cross-utterance information and was found to have better LM scores on words where recognition errors are more likely to occur. Expand
Session-level Language Modeling for Conversational Speech
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomenaExpand
Improving Intent Classification in an E-commerce Voice Assistant by Using Inter-Utterance Context
This work improves the intent classification in an English based e-commerce voice assistant by using inter-utterance context by using the intent of the previous user utterance to predict theintent of her current utterance in Walmart’s e- commerce voice assistant. Expand
Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech
A multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data by performing ablation study on various sizes of the corpus is explored. Expand
Long-span language modeling for speech recognition
A new architecture that incorporates an attention mechanism into LSTM to combine the benefits of recurrent and attention architectures and describe speech recognition experiments using long-span language models in second-pass re-ranking to provide insights into the ability of such models to take advantage of context beyond the current sentence. Expand
Training Language Models for Long-Span Cross-Sentence Evaluation
This work trains language models based on long short-term memory recurrent neural networks and Transformers using various types of training sequences and studies their robustness with respect to different evaluation modes, showing that models trained with back-propagation over sequences consisting of concatenation of multiple sentences with state carry-over across sequences effectively outperform those trained with the sentence-level training. Expand
Scalable Language Model Adaptation for Spoken Dialogue Systems
This paper proposes a solution to estimate n-gram counts directly from the hand-written grammar for training LMs and uses constrained optimization to optimize the system parameters for future use cases, while not degrading the performance on past usage. Expand
Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset
This work introduces the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains, and presents a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots provided as input. Expand
Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data
The MultiDoGO dataset is introduced, which is over 8 times the size of MultiWOZ, the other largest comparable dialogue dataset currently available to the public, and adopted a Wizard-of-Oz approach wherein a crowd-sourced worker is paired with a trained annotator. Expand