• Corpus ID: 238198240

Unsolved Problems in ML Safety

@article{Hendrycks2021UnsolvedPI,
  title={Unsolved Problems in ML Safety},
  author={Dan Hendrycks and Nicholas Carlini and John Schulman and Jacob Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.13916}
}
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely… 

Figures and Tables from this paper

Taxonomy of Machine Learning Safety: A Survey and Primer
TLDR
The taxonomy of ML safety presents a safety-oriented categorization of ML techniques to provide guidance for improving dependability of the ML design and development and can serve as a safety checklist to aid designers in improving coverage and diversity of safety strategies employed in any given ML system.
How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review
TLDR
A Systematic Literature Review of research papers published between 2015 to 2020, covering topics related to the certification of ML systems, highlighted the enthusiasm of the community for this subject, as well as the lack of diversity in term of datasets and type of ML models.
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
TLDR
A new data augmentation strategy utilizing the natural structural complexity of pictures such as fractals, which outperforms numerous baselines, is near Pareto-optimal, and roundly improves safety measures is designed, demonstrating that P IX M IX can improve uncertainty estimation under distribution shifts with unseen image corruptions.
Predictability and Surprise in Large Generative Models
TLDR
This paper highlights a counterintuitive property of large-scale generative models, which has an unusual combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzes how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
TLDR
An anomaly detection task for aberrant policies is proposed and several baseline detectors are offered for phase transitions: capability thresholds at which the agent’s behavior qualitatively shifts, leading to a sharp decrease in the true reward.
The Dilemma Between Data Transformations and Adversarial Robustness for Time Series Application Systems
TLDR
This work explores how data transformations techniques such as feature selection, dimensionality reduction, or trend extraction techniques may impact an adversary’s ability to create effective adversarial samples on a recurrent neural network and analyzed it from the perspective of the data manifold and the presentation of its intrinsic features.
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges
TLDR
This survey aims to provide a cross-domain and comprehensive review of numerous eminent works in respective areas while identifying their commonalities and discusses and shed light on future lines of research, intending to bring these fields closer together.
Self-Supervised Losses for One-Class Textual Anomaly Detection
TLDR
The self-supervision approach outperforms other methods under various anomaly detection scenarios, improving the AUROC score on semantic anomalies by 11.6% and on syntactic anomalies by 22.8% on average.
Certified Adversarial Defenses Meet Out-of-Distribution Corruptions: Benchmarking Robustness and Simple Baselines
TLDR
FourierMix augmentations help eliminate the spectral bias of certifiably robust models enabling them to achieve significantly better robustness guarantees on a range of OOD benchmarks, and a comprehensive benchmarking suite that contains corruptions from different regions in the spectral domain is proposed.
Red Teaming Language Models with Language Models
TLDR
This work automatically finds cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM, and evaluates the target LM’s replies to generated test questions using a classifier trained to detect offensive content.
...
1
2
...

References

SHOWING 1-10 OF 220 REFERENCES
Risks from Learned Optimization in Advanced Machine Learning Systems
We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce
Concrete Problems in AI Safety
TLDR
A list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function, an objective function that is too expensive to evaluate frequently, or undesirable behavior during the learning process, are presented.
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
TLDR
A new data augmentation strategy utilizing the natural structural complexity of pictures such as fractals, which outperforms numerous baselines, is near Pareto-optimal, and roundly improves safety measures is designed, demonstrating that P IX M IX can improve uncertainty estimation under distribution shifts with unseen image corruptions.
Benchmarking Safe Exploration in Deep Reinforcement Learning
TLDR
This work proposes to standardize constrained RL as the main formalism for safe exploration, and presents the Safety Gym benchmark suite, a new slate of high-dimensional continuous control environments for measuring research progress on constrained RL.
Alignment for Advanced Machine Learning Systems
TLDR
This research proposal focuses on two major technical obstacles to AI alignment: the challenge of specifying the right kind of objective functions and designing AI systems that avoid unintended consequences and undesirable behavior even in cases where the objective function does not line up perfectly with the intentions of the designers.
A 20-Year Community Roadmap for Artificial Intelligence Research in the US
TLDR
These are the major recommendations of a recent community effort coordinated by the Computing Community Consortium and the Association for the Advancement of Artificial Intelligence to formulate a Roadmap for AI research and development over the next two decades.
Conservative Objective Models for Effective Offline Model-Based Optimization
TLDR
Conservative objective models (COMs) are proposed, a method that learns a model of the objective function which lower bounds the actual value of the ground-truth objective on outof-distribution inputs and uses it for optimization.
Explaining Explanations: An Overview of Interpretability of Machine Learning
There has recently been a surge of work in explanatory artificial intelligence (XAI). This research area tackles the important problem that complex machines and algorithms often cannot provide
The Values Encoded in Machine Learning Research
TLDR
This paper presents a rigorous examination of the values of the field by quantitatively and qualitatively analyzing 100 highly cited ML papers published at premier ML conferences, ICML and NeurIPS and finds increasingly close ties between these highly cited papers and tech companies and elite universities.
Quantifying Generalization in Reinforcement Learning
TLDR
It is shown that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.
...
1
2
3
4
5
...