• Publications
  • Influence
Deep Anomaly Detection with Outlier Exposure
TLDR
In extensive experiments on natural language processing and small- and large-scale vision tasks, it is found that Outlier Exposure significantly improves detection performance and that cutting-edge generative models trained on CIFar-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; OE is used to mitigate this issue.
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty
TLDR
This work finds that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and common input corruptions, and greatly benefits out-of-distribution detection on difficult, near-dist distribution outliers.
Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise
TLDR
It is demonstrated that robustness to label noise up to severe strengths can be achieved by using a set of trusted data with clean labels, and a loss correction that utilizes trusted examples in a data-efficient manner to mitigate the effects of label noise on deep neural network classifiers is proposed.
Using Pre-Training Can Improve Model Robustness and Uncertainty
TLDR
It is shown that although pre-training may not improve performance on traditional classification metrics, it improves model robustness and uncertainty estimates and surpasses the state-of-the-art in adversarial robustness.
A Benchmark for Anomaly Segmentation
TLDR
The Combined Anomalous Object Segmentation benchmark is introduced, which combines two novel datasets for anomaly segmentation that incorporate both realism and anomaly diversity and improves out-of-distribution detectors on large-scale multi-class datasets and introduces detectors for the previously unexplored setting of multi-label out- of-dist distribution detection.
Measuring Massive Multitask Language Understanding
TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
Scaling Out-of-Distribution Detection for Real-World Settings
TLDR
This work departs from small-scale settings and explores large-scale multiclass and multi-label settings with high-resolution images and hundreds of classes for out-of-distribution detection, finding that a surprisingly simple detector based on the maximum logit outperforms prior methods in all the large- scale multi-class, multi- label, and segmentation tasks.
Measuring Coding Challenge Competence With APPS
TLDR
APPS is introduced, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and shows that machine learning models are now beginning to learn how to code.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
TLDR
Evaluation of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
TLDR
This work introduces Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of semantically rich, morally salient scenarios that robustly evaluate whether agents can act morally while maximizing reward.
...
...