Language Model Evaluation Beyond Perplexity

  title={Language Model Evaluation Beyond Perplexity},
  author={Clara Meister and Ryan Cotterell},
We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the humangenerated text on which they were trained. We provide a framework—paired with significance tests—for evaluating the fit of language models to these trends. We find that neural language models… 
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN
AVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure, is introduced, showing that GPT-2's novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues.
Modeling the Unigram Distribution
This work presents a novel model for estimating the unigram distribution in a language (a neuralization of Goldwater et al.
Boosting coherence of language models
It is found that coherence boosting with state-of-the-art models for various zeroshot NLP tasks yields performance gains with no additional training.


Evaluating Computational Language Models with Scaling Properties of Natural Language
Through comparison with recently proposed model-based evaluation methods, it is found that the exponent of Taylor’s law is a good indicator of model quality.
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Improved Natural Language Generation via Loss Truncation
Loss truncation is proposed: a simple and scalable procedure which adaptively removes high log loss examples as a way to optimize for distinguishability and it is demonstrated that loss truncation outperforms existing baselines on distinguishability on a summarization task.
Zipf’s word frequency law in natural language: A critical review and future directions
  • S. Piantadosi
  • Psychology, Medicine
    Psychonomic bulletin & review
  • 2014
It is shown that human language has a highly complex, reliable structure in the frequency distribution over and above Zipf’s law, although prior data visualization methods have obscured this fact.
The Curious Case of Neural Text Degeneration
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Do neural nets learn statistical laws behind natural language?
Empirical evidence is provided that a neural language model based on long short-term memory (LSTM) effectively reproduces Zipf’s law and Heaps’ law, two representative statistical properties underlying natural language.
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.
Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information
It is shown that ‘diagnostic classifiers’, trained to predict number from the internal states of a language model, provide a detailed understanding of how, when, and where this information is represented, and this knowledge can be used to improve their performance.
Regularizing and Optimizing LSTM Language Models
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.