• Corpus ID: 236905982

Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition

  title={Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition},
  author={Shih-Hsuan Chiu and Tien-Hong Lo and Fu-An Chao and Berlin Chen},
How to effectively incorporate cross-utterance information cues into a neural language model (LM) has emerged as one of the intriguing issues for automatic speech recognition (ASR). Existing research efforts on improving contextualization of an LM typically regard previous utterances as a sequence of additional input and may fail to capture complex global structural dependencies among these utterances. In view of this, we in this paper seek to represent the historical context information of an… 

Figures and Tables from this paper


Innovative Bert-Based Reranking Language Models for Speech Recognition
This paper presents a novel instantiation of the BERT-based contextualized language models (LMs) for use in reranking of N-best hypotheses produced by automatic speech recognition (ASR) and explores to capitalize on task-specific global topic information in an unsupervised manner to assist PBERT in N- best hypothesis reranking.
ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling
This paper investigates various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses and proposes a multi-task model that can jointly perform content word detection and language modeling tasks.
Session-level Language Modeling for Conversational Speech
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena
Training Language Models for Long-Span Cross-Sentence Evaluation
This work trains language models based on long short-term memory recurrent neural networks and Transformers using various types of training sequences and studies their robustness with respect to different evaluation modes, showing that models trained with back-propagation over sequences consisting of concatenation of multiple sentences with state carry-over across sequences effectively outperform those trained with the sentence-level training.
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
This work proposes to use text-based external word and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end framework, yielding significant improvement in word error rate with better conversational-context representation.
Bidirectional recurrent neural network language models for automatic speech recognition
It is found that biddirectional RNNs significantly outperform unidirectional Rnns, but bidirectional LSTMs do not provide any further gain over their unid Directional counterparts.
Achieving Human Parity in Conversational Speech Recognition
The human error rate on the widely used NIST 2000 test set is measured, and the latest automated speech recognition system has reached human parity, establishing a new state of the art, and edges past the human benchmark.
Speech recognition in a dialog system: from conventional to deep processing
The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture
LSTM Neural Networks for Language Modeling
This work analyzes the Long Short-Term Memory neural network architecture on an English and a large French language modeling task and gains considerable improvements in WER on top of a state-of-the-art speech recognition system.
Language Modeling with Deep Transformers
The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of positional information and it is found that removing the positional encoding even slightly improves the performance of these models.