Corpus ID: 209376307

Self-Play Learning Without a Reward Metric

  title={Self-Play Learning Without a Reward Metric},
  author={Dan Schmidt and N. Moran and Jonathan S. Rosenfeld and Jonathan Rosenthal and J. Yedidia},
The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to… Expand


Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
This paper generalises the approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains, and convincingly defeated a world-champion program in each case. Expand
Ranked Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimization
Results from applying the R2 algorithm to instances of a two-dimensional and three-dimensional bin packing problems show that it outperforms generic Monte Carlo tree search, heuristic algorithms and integer programming solvers. Expand
Learning values across many orders of magnitude
This work proposes to adaptively normalize the targets used in learning, useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when the policy of behavior changes. Expand
A Survey of Preference-Based Reinforcement Learning Methods
A unified framework for PbRL is provided that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. Expand
SAI: a Sensible Artificial Intelligence that plays with handicap and targets high scores in 9x9 Go (extended version)
We develop a new model that can be applied to any perfect information two-player zero-sum game to target a high score, and thus a perfect play. We integrate this model into the Monte Carlo treeExpand
Accelerating Self-Play Learning in Go
By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods. LikeExpand
Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping
Conditions under which modi cations to the reward function of a Markov decision process preserve the op timal policy are investigated to shed light on the practice of reward shap ing a method used in reinforcement learn ing whereby additional training rewards are used to guide the learning agent. Expand
Natural Evolution Strategies
NES is presented, a novel algorithm for performing real-valued dasiablack boxpsila function optimization: optimizing an unknown objective function where algorithm-selected function measurements constitute the only information accessible to the method. Expand
Deep Ordinal Reinforcement Learning
This paper shows how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduces Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards. Expand
Evolution strategy: Optimization of technical systems by means of biological evolution
  • Fromman-Holzboog, Stuttgart 104:15–16.
  • 1973