# Linear Programming for Large-Scale Markov Decision Problems

@inproceedings{Malek2014LinearPF, title={Linear Programming for Large-Scale Markov Decision Problems}, author={Alan Malek and Yasin Abbasi-Yadkori and P. Bartlett}, booktitle={ICML}, year={2014} }

We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low… Expand

#### Supplemental Presentations

#### 35 Citations

Optimizing over a Restricted Policy Class in Markov Decision Processes

- Computer Science, Mathematics
- ArXiv
- 2018

This work addresses the problem of finding an optimal policy in a Markov decision process under a restricted policy class defined by the convex hull of a set of base policies and shows that there exists an efficient algorithm that finds a policy that is almost as good as the best convex combination of the base policies. Expand

Stochastic Primal-Dual Method for Learning Mixture Policies in Markov Decision Processes

- Computer Science
- 2019 IEEE 58th Conference on Decision and Control (CDC)
- 2019

This work compute the actions of a policy that is nearly as good as a policy chosen by a suitable oracle from a given mixture policy class characterized by the convex hull of a set of base policies. Expand

Large-Scale Markov Decision Problems with KL Control Cost and its Application to Crowdsourcing

- Mathematics, Computer Science
- ICML
- 2015

This work shows that for problems with a Kullback-Leibler divergence cost function, policy optimization can be recast as a convex optimization and solved approximately using a stochastic subgradient algorithm. Expand

Optimizing over a Restricted Policy Class in MDPs

- Computer Science
- AISTATS
- 2019

This work addresses the problem of finding an optimal policy in a Markov decision process (MDP) under a restricted policy class defined by the convex hull of a set of base policies, and proposes an efficient algorithm that finds a policy whose performance is almost as good as that of the best convex combination of the base policies. Expand

On Sample Complexity of Projection-Free Primal-Dual Methods for Learning Mixture Policies in Markov Decision Processes

- Mathematics, Computer Science
- ArXiv
- 2019

The primal-dual achieves better efficiency and low variance across different trials compared to the penalty function method, and a modification of the proposed algorithm with the polytope constraint sampling for the smoothed ALP, where the restriction to lower bounding approximations are relaxed. Expand

Parameterized MDPs and Reinforcement Learning Problems - A Maximum Entropy Principle Based Framework

- Medicine, Computer Science
- IEEE transactions on cybernetics
- 2021

The central idea underlying the framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. Expand

L G ] 2 0 M ar 2 01 9 ON SAMPLE COMPLEXITY OF PROJECTION-FREE PRIMAL-DUAL METHODS FOR LEARNING MIXTURE POLICIES IN MARKOV DECISION PROCESSES

- 2019

We study the problem of learning policy of an infinite-horizon, discounted cost, Markov decision process (MDP) with a large number of states. We compute the actions of a policy that is nearly as good… Expand

Efficient Performance Bounds for Primal-Dual Reinforcement Learning from Demonstrations

- Computer Science
- ICML
- 2021

To bridge the gap between theory and practice, a novel bilinear saddle-point framework using Lagrangian duality is introduced and a model-free provably efficient algorithm is developed through the lens of stochastic convex optimization. Expand

Stochastic convex optimization for provably efficient apprenticeship learning

- 2019

We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of… Expand

Large Scale Markov Decision Processes with Changing Rewards

- Computer Science, Mathematics
- NeurIPS
- 2019

An algorithm is provided that achieves state-of-the-art regret bound of $O( \tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards, which to the best of the knowledge is the first. Expand

#### References

SHOWING 1-10 OF 42 REFERENCES

A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees

- Mathematics, Computer Science
- Math. Oper. Res.
- 2006

A bound is established on the performance of the resulting policy that scales gracefully with the number of states without imposing the strong Lyapunov condition required by its counterpart in de Farias and Van Roy. Expand

Approximate Linear Programming for Average Cost MDPs

- Mathematics, Computer Science
- Math. Oper. Res.
- 2013

Bounds are derived for average cost error and performance of the policy generated from the LP that involve the mixing time of the Markov decision process MDP under this policy or the optimal policy, improving on a previous performance bound involving mixing times. Expand

Approximate Linear Programming for Average-Cost Dynamic Programming

- Computer Science, Mathematics
- NIPS
- 2002

A two-phase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights is proposed. Expand

Approximate Dynamic Programming via a Smoothed Linear Program

- Computer Science, Mathematics
- Oper. Res.
- 2012

A novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems, called the “smoothed approximate linear program”, which outperforms the existing LP approach by a substantial margin. Expand

Solving Factored MDPs with Continuous and Discrete Variables

- Computer Science, Mathematics
- UAI
- 2004

A new linear program approximation method that exploits the structure of the hybrid MDP and lets us compute approximate value functions more efficiently is presented and a new factored discretization of continuous variables that avoids the exponential blow-up of traditional approaches is described. Expand

A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation

- Mathematics
- NIPS 2008
- 2008

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and… Expand

Dynamic Programming and Optimal Control, Vol. II

- Computer Science
- 1976

A major revision of the second volume of a textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and… Expand

Toward Off-Policy Learning Control with Function Approximation

- Mathematics, Computer Science
- ICML
- 2010

The Greedy-GQ algorithm is an extension of recent work on gradient temporal-difference learning to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. Expand

Linear Program Approximations for Factored Continuous-State Markov Decision Processes

- Computer Science, Mathematics
- NIPS
- 2003

It is argued that this approach offers a robust alternative for solving high dimensional continuous-state space problems and is supported by experiments on three CMDP problems with 24-25 continuous state factors. Expand

Online learning for linearly parametrized control problems

- Mathematics
- 2012

In a discrete-time online control problem, a learner makes an effort to control the state of an initially unknown environment so as to minimize the sum of the losses he suffers, where the losses are… Expand