• Publications
  • Influence
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
TLDR
It is found that using larger models and artificial data augmentations can improve robustness on real-world distribution shifts, contrary to claims in prior work.
Natural Adversarial Examples
TLDR
This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models.
A Benchmark for Anomaly Segmentation
TLDR
The Combined Anomalous Object Segmentation benchmark is introduced, which combines two novel datasets for anomaly segmentation that incorporate both realism and anomaly diversity and improves out-of-distribution detectors on large-scale multi-class datasets and introduces detectors for the previously unexplored setting of multi-label out- of-dist distribution detection.
Aligning AI With Shared Human Values
TLDR
With the ETHICS dataset, it is found that current language models have a promising but incomplete understanding of basic ethical knowledge, and it provides a steppingstone toward AI that is aligned with human values.
DIODE: A Dense Indoor and Outdoor DEpth Dataset
TLDR
DIODE (Dense Indoor/Outdoor DEpth) is the first public dataset to include RGBD images of indoor and outdoor scenes obtained with one sensor suite, in contrast to existing datasets that focus on just one domain/scene type and employ different sensors, making generalization across domains difficult.
Measuring Massive Multitask Language Understanding
TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
Scaling Out-of-Distribution Detection for Real-World Settings
TLDR
This work departs from small-scale settings and explores large-scale multiclass and multi-label settings with high-resolution images and hundreds of classes for out-of-distribution detection, finding that a surprisingly simple detector based on the maximum logit outperforms prior methods in all the large- scale multi-class, multi- label, and segmentation tasks.
Measuring Mathematical Problem Solving With the MATH Dataset
TLDR
This work introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems which can be used to teach models to generate answer derivations and explanations, and shows that accuracy remains relatively low, even with enormous Transformer models.
Measuring Coding Challenge Competence With APPS
TLDR
APPS is introduced, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and shows that machine learning models are now beginning to learn how to code.
A Quantitative Measure of Generative Adversarial Network Distributions
TLDR
A new measure for evaluating the quality of distributions learned by Generative Adversarial Networks (GANs) is introduced and it computes the KullbackLeibler divergence from a GAN-generated image set to a real image set.
...
1
2
...