• Publications
  • Influence
Measuring Massive Multitask Language Understanding
TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Expand
Aligning AI With Shared Human Values
TLDR
With the ETHICS dataset, it is found that current language models have a promising but incomplete understanding of basic ethical knowledge, and it provides a steppingstone toward AI that is aligned with human values. Expand
Interpreting Black Box Models via Hypothesis Testing
TLDR
This work reframe black box model interpretability as a multiple hypothesis testing problem, and proposes two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets. Expand
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
TLDR
It is found that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size, so there is still substantial room for improvement. Expand
Measuring Coding Challenge Competence With APPS
TLDR
APPS, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification, and finds that machine learning models are beginning to learn how to code. Expand
Interpreting Black Box Models with Statistical Guarantees
TLDR
A multiple hypothesis testing framework for finding important features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with randomly-sampled counterfactuals is derived. Expand
Measuring Mathematical Problem Solving With the MATH Dataset
TLDR
This work introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems which can be used to teach models to generate answer derivations and explanations, and shows that accuracy remains relatively low, even with enormous Transformer models. Expand
Streaming Complexity of SVMs
TLDR
It is shown that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than SGD for strongly convex functions like the bias-regularized SVM, and polynomial lower bounds for both point estimation and optimization are proved. Expand
Limitations of Post-Hoc Feature Alignment for Robustness
TLDR
It is shown that this approach only significantly helps with a narrow set of distribution shifts and several settings in which it even degrades performance are identified, calling into question the utility of this approach and Unsupervised Domain Adaptation more broadly for improving robustness in practice. Expand