Language Model Evaluation Beyond Perplexity

@inproceedings{Meister2021LanguageME,
  title={Language Model Evaluation Beyond Perplexity},
  author={Clara Meister and Ryan Cotterell},
  booktitle={ACL/IJCNLP},
  year={2021}
}
We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the humangenerated text on which they were trained. We provide a framework—paired with significance tests—for evaluating the fit of language models to these trends. We find that neural language models… Expand
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN
TLDR
AVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure, is introduced, showing that GPT-2's novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues. Expand
Modeling the Unigram Distribution
TLDR
This work presents a novel model for estimating the unigram distribution in a language (a neuralization of Goldwater et al. Expand
Boosting coherence of language models
TLDR
It is found that coherence boosting with state-of-the-art models for various zeroshot NLP tasks yields performance gains with no additional training. Expand

References

SHOWING 1-10 OF 52 REFERENCES
Evaluating Computational Language Models with Scaling Properties of Natural Language
TLDR
Through comparison with recently proposed model-based evaluation methods, it is found that the exponent of Taylor’s law is a good indicator of model quality. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Improved Natural Language Generation via Loss Truncation
TLDR
Loss truncation is proposed: a simple and scalable procedure which adaptively removes high log loss examples as a way to optimize for distinguishability and it is demonstrated that loss truncation outperforms existing baselines on distinguishability on a summarization task. Expand
Zipf’s word frequency law in natural language: A critical review and future directions
  • S. Piantadosi
  • Psychology, Medicine
  • Psychonomic bulletin & review
  • 2014
TLDR
It is shown that human language has a highly complex, reliable structure in the frequency distribution over and above Zipf’s law, although prior data visualization methods have obscured this fact. Expand
The Curious Case of Neural Text Degeneration
TLDR
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence. Expand
Cross-lingual Language Model Pretraining
TLDR
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective. Expand
Do neural nets learn statistical laws behind natural language?
TLDR
Empirical evidence is provided that a neural language model based on long short-term memory (LSTM) effectively reproduces Zipf’s law and Heaps’ law, two representative statistical properties underlying natural language. Expand
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
TLDR
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured. Expand
Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information
TLDR
It is shown that ‘diagnostic classifiers’, trained to predict number from the internal states of a language model, provide a detailed understanding of how, when, and where this information is represented, and this knowledge can be used to improve their performance. Expand
Regularizing and Optimizing LSTM Language Models
TLDR
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Expand
...
1
2
3
4
5
...