Adversarial Training for High-Stakes Reliability

  title={Adversarial Training for High-Stakes Reliability},
  author={Daniel M. Ziegler and Seraphina Nix and Lawrence Chan and Tim Bauman and Peter Schmidt-Nielsen and Tao Lin and Adam Scherlis and Noa Nabeshima and Ben Weinstein-Raun and Daniel Haas and Buck Shlegeris and Nate Thomas},
In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a language generation task as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques… 
Robust Feature-Level Adversaries are Interpretability Tools
The results indicate that feature-level attacks are a promising approach for rigorous interpretability research and support the design of tools to better understand what a model has learned and diagnose brittle feature associations.


Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
PaLM: Scaling Language Modeling with Pathways
A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
These experiments validate that SayCan can execute temporally-extended, complex, and abstract instructions andGrounding the LLM in the real-world via affordances nearly doubles the performance over the non-grounded baselines.
Training Compute-Optimal Large Language Models
This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
MuZero with Self-competition for Rate Control in VP9 Video Compression
This paper targets the problem of learning a rate control policy to select the quantization parameters (QP) in the encoding process of libvpx, an open source VP9 video compression library widely used by popular video-on-demand (VOD) services.
Training language models to follow instructions with human feedback
The results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent and showing improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
LaMDA: Language Models for Dialog Applications
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants
Generative Annotation Assistants (GAAs) are introduced, generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely and are found to lead to higher downstream model performance on a variety of question answering tasks over adversarial data collection.