• Corpus ID: 211171753

Optimistic Policy Optimization with Bandit Feedback

@article{Efroni2020OptimisticPO,
  title={Optimistic Policy Optimization with Bandit Feedback},
  author={Yonathan Efroni and Lior Shani and Aviv Rosenberg and Shie Mannor},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.08243}
}
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust… 

Tables from this paper

Policy Optimization as Online Learning with Mediator Feedback
TLDR
The notion of mediator feedback that frames PO as an online learning problem over the policy space is introduced, and RANDOMIST is extended to compact policy spaces, to derive problem-dependent regret lower bounds.
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
TLDR
A new high-probability drift analysis of Lagrange multiplier processes is incorporated into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of "optimism in the face of uncertainty" in constrained online learning.
Provably Correct Optimization and Exploration with Non-linear Policies
TLDR
ENIAC is designed, an actor-critic method that allows non-linear function approximation in the critic and outperforms prior heuristics inspired by linear methods, establishing the value in correctly reasoning about the agent’s uncertainty under non- linear function approximation.
Dynamic Regret of Policy Optimization in Non-stationary Environments
TLDR
This work proposes two model-free policy optimization algorithms, POWER and POWER++, and establishes guarantees for their dynamic regret, and shows that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction.
Optimization Issues in KL-Constrained Approximate Policy Iteration
TLDR
This work compares the use of KL divergence as a constraint vs. as a regularizer, and point out several optimization issues with the widely-used constrained approach, and shows that regularization can improve the optimization landscape of the original objective.
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
TLDR
This work introduces the the Policy Cover-Policy Gradient algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover) and complements the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
Exploration-Exploitation in Constrained MDPs
TLDR
This work analyzes two approaches for learning in Constrained Markov Decision Processes and highlights a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.
Minimax Regret for Stochastic Shortest Path
TLDR
An algorithm is provided for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon and this algorithm is based on a novel reduction from SSP to finite-Horizon MDPs.
Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition
TLDR
These results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret, and are also the first to consider bandit feedback with adversarial costs.
Learning Adversarial Markov Decision Processes with Delayed Feedback
TLDR
This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback, and is the first to consider regret minimization in the important setting of MDPs with delayed feedback.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Provably Efficient Exploration in Policy Optimization
TLDR
This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret.
Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
TLDR
It is established that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, and full-planning in model-based RL can be avoided altogether without any performance degradation, and the computational complexity decreases.
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
TLDR
This work shows that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis, and proves fast rates of Õ(1/N), much like results in convex optimization.
Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
TLDR
This result significantly improves over the $\mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.
Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
TLDR
An algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions.
POLITEX: Regret Bounds for Policy Iteration using Expert Prediction
TLDR
POLicy ITeration with EXpert advice is presented, a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and the viability of POLITEX beyond linear function approximation is confirmed.
Minimax Regret Bounds for Reinforcement Learning
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of
Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
TLDR
The algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting and achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback.
On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
TLDR
This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs), and shows an important interplay between estimation error, approximation error, and exploration.
Near-optimal Regret Bounds for Reinforcement Learning
TLDR
This work presents a reinforcement learning algorithm with total regret O(DS√AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D, and proposes a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps.
...
...