Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

@inproceedings{Chen2022TeachingBT,
  title={Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection},
  author={Angelica Chen and Victoria Zayats and Daniel David Walker and Dirk Ryan Padfield},
  booktitle={NAACL},
  year={2022}
}
In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. While this post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation), most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays for the user. In this work we propose a streaming BERT-based sequence tagging model that… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 31 REFERENCES

Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection

TLDR
A Controllable Time-delay Transformer model that jointly completes the punctuation prediction and disfluency detection tasks in real time and facilitates freezing partial outputs with controllable time delay to fulfill the real-time constraints in partial decoding required by subsequent applications is proposed.

Best of Both Worlds: Making High Accuracy Non-incremental Transformer-based Disfluency Detection Incremental

TLDR
This work addresses the challenge of introducing methods for word-by-word left-to-right incremental processing to Transformers such as BERT, models without an intrinsic sense of linear order and shows the method of incrementalising Transformers maintains most of their high non-incremental performance while operating strictly incrementally.

Multi-Task Learning for Domain-General Spoken Disfluency Detection in Dialogue Systems

TLDR
This paper presents a multi-task LSTM-based model for incremental detection of disfluency structure, which can be hooked up to any component for incremental interpretation, or else simply used to `clean up' the current utterance as it is being produced.

Recurrent neural networks for incremental disfluency detection

TLDR
This work frames incremental disfluency detection as a word-by-word tagging task and tests the performance of Recurrent Neural Networks, showing very good incremental properties with low latency and very good output stability.

Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU

TLDR
This work investigates how bidirectional encoders behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems.

Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU

TLDR
The results show that the recurrent LT model has better incremental performance and faster inference speed compared to the standard Transformer and LT with restart-incrementality, at the cost of part of the non-incremental (full sequence) quality.

Disfluency Detection Using a Bidirectional LSTM

TLDR
A new approach for disfluency detection using a Bidirectional Long-Short Term Memory neural network (BLSTM), which takes as input pattern match features that were developed to reduce sensitivity to vocabulary size in training, which lead to improved performance over the word sequence alone.

Detecting Speech Repairs Incrementally Using a Noisy Channel Approach

TLDR
This work proposes a novel approach to evaluation, which evaluates performance in detecting and correcting disfluencies incrementally, rather than only assessing performance once the processing of an utterance is complete.

Segmentation and disfluency removal for conversational speech translation

TLDR
This paper proposes a new approach to do simple-disfluency removal followed by segmentation and then by complex-dfluency removal, which shows a significant gain on translation performance of up to 3 Bleu points with only 6 second latency to look ahead, using state-ofthe art machine translation and speech recognition systems.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.