Measuring Massive Multitask Language Understanding
- Dan Hendrycks, Collin Burns, J. Steinhardt
- Computer ScienceInternational Conference on Learning…
- 7 September 2020
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
Aligning AI With Shared Human Values
- Dan Hendrycks, Collin Burns, J. Steinhardt
- Computer ScienceInternational Conference on Learning…
- 5 August 2020
With the ETHICS dataset, it is found that current language models have a promising but incomplete understanding of basic ethical knowledge, and it provides a steppingstone toward AI that is aligned with human values.
Measuring Coding Challenge Competence With APPS
- Dan Hendrycks, Steven Basart, J. Steinhardt
- Computer ScienceNeurIPS Datasets and Benchmarks
- 20 May 2021
APPS is introduced, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and shows that machine learning models are now beginning to learn how to code.
Measuring Mathematical Problem Solving With the MATH Dataset
- Dan Hendrycks, Collin Burns, J. Steinhardt
- Computer ScienceNeurIPS Datasets and Benchmarks
- 5 March 2021
This work introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems which can be used to teach models to generate answer derivations and explanations, and shows that accuracy remains relatively low, even with enormous Transformer models.
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
- Dan Hendrycks, Collin Burns, Anya Chen, Spencer Ball
- Computer ScienceNeurIPS Datasets and Benchmarks
- 10 March 2021
It is found that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size, so there is still substantial room for improvement.
Interpreting Black Box Models via Hypothesis Testing
- Collin Burns, Jesse Thomason, Wesley Tansey
- Computer ScienceFoundations of Data Science Conference
- 29 March 2019
This work reframe black box model interpretability as a multiple hypothesis testing problem, and proposes two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets.
Limitations of Post-Hoc Feature Alignment for Robustness
- Collin Burns, J. Steinhardt
- Computer ScienceComputer Vision and Pattern Recognition
- 10 March 2021
It is shown that this approach only significantly helps with a narrow set of distribution shifts and several settings in which it even degrades performance are identified, calling into question the utility of this approach and Unsupervised Domain Adaptation more broadly for improving robustness in practice.
Interpreting Black Box Models with Statistical Guarantees
- Collin Burns, Jesse Thomason, Wesley Tansey
- Computer ScienceArXiv
- 29 March 2019
A multiple hypothesis testing framework for finding important features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with randomly-sampled counterfactuals is derived.
Streaming Complexity of SVMs
- Alexandr Andoni, Collin Burns, Yi Li, S. Mahabadi, David P. Woodruff
- Computer ScienceInternational Workshop and International Workshop…
- 7 July 2020
It is shown that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than SGD for strongly convex functions like the bias-regularized SVM, and polynomial lower bounds for both point estimation and optimization are proved.
Discovering Latent Knowledge in Language Models Without Supervision
- Collin Burns, Hao-Tong Ye, D. Klein, J. Steinhardt
- Computer ScienceArXiv
- 7 December 2022
It is shown that despite using no supervision and no model outputs, the method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4% on average.