• Corpus ID: 2681235

Reinforcement Learning under Model Mismatch

  title={Reinforcement Learning under Model Mismatch},
  author={Aurko Roy and Huan Xu and Sebastian Pokutta},
We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal robust policy… 

Model-Free Robust Reinforcement Learning with Linear Function Approximation

This paper proposes Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation, and proves the convergence of this algorithm using stochastic approximation techniques.

Online Robust Reinforcement Learning with Model Uncertainty

This paper develops a sample-based approach to estimate the unknown uncertainty set, and designs a robust Q-learning algorithm and robust TDC algorithm, which can be implemented in an online and incremental fashion and proves the robustness of the algorithms.

Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees

This paper proposes Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation, and proves the convergence of this algorithm using stochastic approximation techniques.

Sample Complexity of Model-Based Robust Reinforcement Learning

A model-based robust reinforcement learning algorithm that learns an -optimal robust value function and policy in a finite state and action space setting when the exact knowledge of the nominal simulator model is not known is proposed.

Policy Gradient Method For Robust Reinforcement Learning

This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch and designs the robust actor-critic method with differentiable parametric policy class and value function.

Sample Complexity of Robust Reinforcement Learning with a Generative Model

This work proposes a model-based reinforcement learning (RL) algorithm for learning an ε -optimal robust policy when the nominal model is unknown, and considers three forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence.

Robust Constrained Reinforcement Learning

This work designs a robust primal-dual approach, and further theoretically develop guarantee on its convergence, complexity and robust feasibility, and investigates a concrete example of δ -contamination uncertainty set, design an online and model-free algorithm and theoretically characterize its sample complexity.

Data-Driven Robust Multi-Agent Reinforcement Learning

This paper develops a robust multi-agent Q-learning algorithm, which is model-free and fully decentralized, and offers provable robustness under model uncertainty without incurring additional computational and memory cost.

Robust Reinforcement Learning using Offline Data

A robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy and proves that RFQI learns a near-optimal robust policy under standard assumptions and demonstrates its superior performance on standard benchmark problems.

A Bayesian Approach to Robust Reinforcement Learning

This study introduces the Uncertainty Robust Bellman Equation (URBE) which encourages safe exploration for adapting the uncertainty set to new observations while preserving robustness and proposes a URBE-based algorithm, DQN-URBE, that scales this method to higher dimensional domains.



Scaling Up Robust MDPs using Function Approximation

This work develops a robust approximate dynamic programming method based on a projected fixed point equation to approximately solve large scale robust MDPs and shows that the proposed method provably succeeds under certain technical conditions, and its effectiveness through simulation of an option pricing problem.

Regularized Policy Iteration

This paper proposes two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD).

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

A finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept, the approximation power of thefunction set and the controllability properties of the MDP is found.

Robust Adversarial Reinforcement Learning

RARL is proposed, where an agent is trained to operate in the presence of a destabilizing adversary that applies disturbance forces to the system and the jointly trained adversary is reinforced - that is, it learns an optimal destabilization policy.

Robust Reinforcement Learning

A new reinforcement learning paradigm that explicitly takes into account input disturbance as well as modeling errors is proposed, which is called robust reinforcement learning (RRL) and tested on the control task of an inverted pendulum.

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm, and proves that its expected update is in the direction of the gradient, assuring convergence under the usual stoChastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity.

Least Squares Policy Evaluation Algorithms with Linear Function Approximation

A new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe is proposed, and the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].

Solving Uncertain Markov Decision Processes

The authors demonstrate that the uncertain model approach can be used to solve a class of nearly Markovian Decision Problems, providing lower bounds on performance in stochastic models with higher-order interactions.

Robust Dynamic Programming

  • G. Iyengar
  • Mathematics, Economics
    Math. Oper. Res.
  • 2005
It is proved that when this set of measures has a certain "rectangularity" property, all of the main results for finite and infinite horizon DP extend to natural robust counterparts.

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

This work presents a Bellman error objective function and two gradient-descent TD algorithms that optimize it, and proves the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution.