Corpus ID: 221516475

Measuring Massive Multitask Language Understanding

@article{Hendrycks2021MeasuringMM,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and D. Song and J. Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2009.03300}
}
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the… Expand

Figures and Tables from this paper

Measuring Mathematical Problem Solving With the MATH Dataset
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
What Makes Good In-Context Examples for GPT-3?
How Can We Know What Language Models Know?
Scaling Laws for Transfer
Language Models are Open Knowledge Graphs
Neural Transfer Learning with Transformers for Social Science Text Analysis
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
...
1
2
...

References

SHOWING 1-10 OF 32 REFERENCES
RACE: Large-scale ReAding Comprehension Dataset From Examinations
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Language Models are Few-Shot Learners
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
...
1
2
3
4
...