• Corpus ID: 245650754

Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning

  title={Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning},
  author={Ziyang Tang and Yihao Feng and Qiang Liu},
Reinforcement learning (RL) has drawn increasing interests in recent years due to its tremendous success in various applications. However, standard RL algorithms can only be applied for single reward function, and cannot adapt to an unseen reward function quickly. In this paper, we advocate a general operator view of reinforcement learning, which enables us to directly approximate the operator that maps from reward function to value function. The benefit of learning the operator is that we can… 

Figures and Tables from this paper



Off-Policy Deep Reinforcement Learning without Exploration

This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.

Fast reinforcement learning with generalized policy updates

It is argued that complex decision problems can be naturally decomposed into multiple tasks that unfold in sequence or in parallel, and associating each task with a reward function can be seamlessly accommodated within the standard reinforcement-learning formalism.

On Reward-Free Reinforcement Learning with Linear Function Approximation

An algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations is given, and the sample complexity is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions.

Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement

This paper shows that the transfer promoted by SFs & GPI leads to very good policies on unseen tasks almost instantaneously, and describes how to learn policies specialised to the new tasks in a way that allows them to be added to the agent's set of skills, and thus be reused in the future.

Soft Actor-Critic Algorithms and Applications

Soft Actor-Critic (SAC), the recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework, achieves state-of-the-art performance, outperforming prior on-policy and off- policy methods in sample-efficiency and asymptotic performance.

Deep Successor Reinforcement Learning

DSR is presented, which generalizes Successor Representations within an end-to-end deep reinforcement learning framework and has several appealing properties including: increased sensitivity to distal reward changes due to factorization of reward and world dynamics, and the ability to extract bottleneck states given successor maps trained under a random policy.

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

This paper develops an off-policy meta-RL algorithm that disentangles task inference and control and performs online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.

Task-agnostic Exploration in Reinforcement Learning

An efficient task-agnostic RL algorithm that finds near-optimal policies for N arbitrary tasks after at most $\tilde O(\log(N)H^5SA/\epsilon^2)$ exploration episodes and provides an $N$-independent sample complexity bound of \textsc{UCBZero} in the statistically easier setting when the ground truth reward functions are known.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

A new estimator, MWL, is introduced that directly estimates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior policy as in prior work.

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.