• Corpus ID: 9028592

A New Softmax Operator for Reinforcement Learning

  title={A New Softmax Operator for Reinforcement Learning},
  author={Kavosh Asadi and Michael L. Littman},
A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one’s weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In… 
A unified view of entropy-regularized Markov decision processes
A general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs) is proposed, showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations.
Bridging the Gap Between Value and Policy Based Reinforcement Learning
A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.
Improving Policy Gradient by Exploring Under-appreciated Rewards
This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties, and successfully solves a benchmark multi-digit addition task and generalizes to long sequences.
The proposed algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences, which, to the authors' knowledge, is the first time that a pure RL method has solved addition using only reward feedback.
Unifying Value Iteration, Advantage Learning, and Dynamic Policy Programming
A new, robust dynamic programming algorithm that unifies value iteration, advantage learning, and dynamic policy programming is proposed, and it is suggested that AGVI is a promising alternative to previous algorithms.
Smoothed Dual Embedding Control
A new reinforcement learning algorithm, called Smoothed Dual Embedding Control or SDEC, is derived to solve the saddle-point reformulation with arbitrary learnable function approximator and compares favorably to the state-of-the-art baselines on several benchmark control problems.
Exploring Hierarchy-Aware Inverse Reinforcement Learning
A new generative model for human planning under the Bayesian Inverse Reinforcement Learning (BIRL) framework which takes into account the fact that humans often plan using hierarchical strategies, and is able to accurately predict the goals of `Wikispeedia' game players.
Deep Reinforcement Learning: An Overview
This work discusses core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration, and important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn.
Deep Reinforcement Learning
  • Yuxi Li
  • Computer Science
    Reinforcement Learning for Cyber-Physical Systems
  • 2019
This work discusses deep reinforcement learning in an overview style, focusing on contemporary work, and in historical contexts, with background of artificial intelligence, machine learning, deep learning, and reinforcement learning (RL), with resources.
Memristive Fully Convolutional Network: An Accurate Hardware Image-Segmentor in Deep Learning
A complete solution to implement memristive FCN (MFCN), in which the conductance values of memristors are predetermined in Tensorflow with ex-situ training method, and the effectiveness of the designed MFCN scheme is verified with improved accuracy over some existing machine learning methods.


Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods
A novel gradient algorithm to learn a policy from an expert's observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem is proposed.
Gradient Descent for General Reinforcement Learning
A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms, and allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search algorithm.
Bayesian Inverse Reinforcement Learning
This paper shows how to combine prior knowledge and evidence from the expert's actions to derive a probability distribution over the space of reward functions and presents efficient algorithms that find solutions for the reward learning and apprenticeship learning tasks that generalize well over these distributions.
On-line Q-learning using connectionist systems
Simulations show that on-line learning algorithms are less sensitive to the choice of training parameters than backward replay, and that the alternative update rules of MCQ-L and Q( ) are more robust than standard Q-learning updates.
Bayesian Q-Learning
This paper extends Watkins' Q-learning by maintaining and propagating probability distributions over the Q-values and establishes the convergence properties of the algorithm, which can exhibit substantial improvements over other well-known model-free exploration strategies.
Reinforcement Learning: An Introduction
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
A Generalized Reinforcement-Learning Model: Convergence and Applications
This paper shows how many of the important theoretical results concerning reinforcement learning in MDPs extend to a generalized MDP model that includes M DPs, two-player games and MDP’s under a worst-case optimality criterion as special cases.
When the Best Move Isn't Optimal: Q-learning with Exploration
Q-learning produces an optimal policy and these estimates converge to the correct values given the optimal policy, and if these estimates are correct, then an agent can use them to select the action with maximal expected future reward in each state, and thus perform optimally.
Relative Entropy Policy Search
The Relative Entropy Policy Search (REPS) method is suggested, which differs significantly from previous policy gradient approaches and yields an exact update step and works well on typical reinforcement learning benchmark problems.
Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms
This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.