Corpus ID: 221516475

Measuring Massive Multitask Language Understanding

@article{Hendrycks2021MeasuringMM,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Xiaodong Song and Jacob Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2009.03300}
}
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the… Expand

Figures and Tables from this paper

Current Limitations of Language Models: What You Need is Retrieval
TLDR
It is argued that improving the performance-computes trade-off of language models can reduce the amount of supervision and efficiently extend the context over the entire training dataset and the entire past of the current sample, and (5) would resolve many of these limitations. Expand
Measuring Mathematical Problem Solving With the MATH Dataset
TLDR
This work introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems which can be used to teach models to generate answer derivations and explanations, and shows that accuracy remains relatively low, even with enormous Transformer models. Expand
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
TLDR
It is suggested that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning, which motivates rethinking the role of prompts in controlling and evaluating powerful language models. Expand
What Makes Good In-Context Examples for GPT-3?
TLDR
This investigation investigates whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT3’s few-shot capabilities and proposes to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Expand
How Can We Know What Language Models Know?
TLDR
This paper proposes mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts to provide a tighter lower bound on what LMs know. Expand
Scaling Laws for Transfer
TLDR
This work finds that pre-training effectively multiplies the fine-tuning dataset size, and believes the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). Expand
Language Models are Open Knowledge Graphs
TLDR
This paper shows how to construct knowledge graphs (KGs) from pre-trained language models (e.g., BERT, GPT-2/3), without human supervision, and proposes an unsupervised method to cast the knowledge contained within language models into KGs. Expand
Neural Transfer Learning with Transformers for Social Science Text Analysis
TLDR
Across all evaluated tasks, textual styles, and training data set sizes, the conventional models are consistently outperformed by transfer learning with Transformer-based models, thereby demonstrating the potential benefits these models can bring to text-based social science research. Expand
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
TLDR
It is found that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size, so there is still substantial room for improvement. Expand
When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
TLDR
It is shown that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 32 REFERENCES
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
RACE: Large-scale ReAding Comprehension Dataset From Examinations
TLDR
The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance. Expand
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. Expand
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
TLDR
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Expand
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
TLDR
A new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. Expand
PIQA: Reasoning about Physical Commonsense in Natural Language
TLDR
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research. Expand
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. Expand
...
1
2
3
4
...