Can Transformer Models Measure Coherence In Text? Re-Thinking the Shuffle Test

  title={Can Transformer Models Measure Coherence In Text? Re-Thinking the Shuffle Test},
  author={Philippe Laban and Luke Dai},
The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a ZeroShot setting: models should be evaluated without being… Expand

Figures and Tables from this paper

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
The results show that purely distributional information largely explains the success of pretraining, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge. Expand


CohEval: Benchmarking Coherence Models
A weak correlation between the model performances in the synthetic tasks and the downstream applications is demonstrated, motivating alternate evaluation methods for coherence models. Expand
A Unified Neural Coherence Model
This paper proposes a unified coherence model that incorporates sentence grammar, inter-sentence coherence relations, and global coherence patterns into a common neural framework, and establishes a new state-of-the-art model. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Sentence Ordering and Coherence Modeling using Recurrent Neural Networks
This work proposes an end- to-end unsupervised deep learning approach based on the set-to-sequence framework to address the structure of coherent texts and shows that useful text representations can be obtained by learning to order sentences. Expand
A Deep Reinforced Model for Abstractive Summarization
A neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL) that produces higher quality summaries. Expand
Abstractive Summarization of Reddit Posts with Multi-level Memory Networks
This work collects Reddit TIFU dataset, consisting of 120K posts from the online discussion forum Reddit, and proposes a novel abstractive summarization model named multi-level memory networks (MMN), equipped with multi- level memory to store the information of text from different levels of abstraction. Expand
Reformer: The Efficient Transformer
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
Universal Language Model Fine-tuning for Text Classification
This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand