Corpus ID: 9897571

Linear Programming for Large-Scale Markov Decision Problems

@inproceedings{Malek2014LinearPF,
  title={Linear Programming for Large-Scale Markov Decision Problems},
  author={Alan Malek and Yasin Abbasi-Yadkori and P. Bartlett},
  booktitle={ICML},
  year={2014}
}
We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low… Expand
Optimizing over a Restricted Policy Class in Markov Decision Processes
TLDR
This work addresses the problem of finding an optimal policy in a Markov decision process under a restricted policy class defined by the convex hull of a set of base policies and shows that there exists an efficient algorithm that finds a policy that is almost as good as the best convex combination of the base policies. Expand
Stochastic Primal-Dual Method for Learning Mixture Policies in Markov Decision Processes
TLDR
This work compute the actions of a policy that is nearly as good as a policy chosen by a suitable oracle from a given mixture policy class characterized by the convex hull of a set of base policies. Expand
Large-Scale Markov Decision Problems with KL Control Cost and its Application to Crowdsourcing
TLDR
This work shows that for problems with a Kullback-Leibler divergence cost function, policy optimization can be recast as a convex optimization and solved approximately using a stochastic subgradient algorithm. Expand
Optimizing over a Restricted Policy Class in MDPs
TLDR
This work addresses the problem of finding an optimal policy in a Markov decision process (MDP) under a restricted policy class defined by the convex hull of a set of base policies, and proposes an efficient algorithm that finds a policy whose performance is almost as good as that of the best convex combination of the base policies. Expand
On Sample Complexity of Projection-Free Primal-Dual Methods for Learning Mixture Policies in Markov Decision Processes
TLDR
The primal-dual achieves better efficiency and low variance across different trials compared to the penalty function method, and a modification of the proposed algorithm with the polytope constraint sampling for the smoothed ALP, where the restriction to lower bounding approximations are relaxed. Expand
Parameterized MDPs and Reinforcement Learning Problems - A Maximum Entropy Principle Based Framework
TLDR
The central idea underlying the framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. Expand
L G ] 2 0 M ar 2 01 9 ON SAMPLE COMPLEXITY OF PROJECTION-FREE PRIMAL-DUAL METHODS FOR LEARNING MIXTURE POLICIES IN MARKOV DECISION PROCESSES
We study the problem of learning policy of an infinite-horizon, discounted cost, Markov decision process (MDP) with a large number of states. We compute the actions of a policy that is nearly as goodExpand
Efficient Performance Bounds for Primal-Dual Reinforcement Learning from Demonstrations
TLDR
To bridge the gap between theory and practice, a novel bilinear saddle-point framework using Lagrangian duality is introduced and a model-free provably efficient algorithm is developed through the lens of stochastic convex optimization. Expand
Stochastic convex optimization for provably efficient apprenticeship learning
We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists ofExpand
Large Scale Markov Decision Processes with Changing Rewards
TLDR
An algorithm is provided that achieves state-of-the-art regret bound of $O( \tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards, which to the best of the knowledge is the first. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 42 REFERENCES
A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees
TLDR
A bound is established on the performance of the resulting policy that scales gracefully with the number of states without imposing the strong Lyapunov condition required by its counterpart in de Farias and Van Roy. Expand
Approximate Linear Programming for Average Cost MDPs
  • M. Veatch
  • Mathematics, Computer Science
  • Math. Oper. Res.
  • 2013
TLDR
Bounds are derived for average cost error and performance of the policy generated from the LP that involve the mixing time of the Markov decision process MDP under this policy or the optimal policy, improving on a previous performance bound involving mixing times. Expand
Approximate Linear Programming for Average-Cost Dynamic Programming
TLDR
A two-phase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights is proposed. Expand
Approximate Dynamic Programming via a Smoothed Linear Program
TLDR
A novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems, called the “smoothed approximate linear program”, which outperforms the existing LP approach by a substantial margin. Expand
Solving Factored MDPs with Continuous and Discrete Variables
TLDR
A new linear program approximation method that exploits the structure of the hybrid MDP and lets us compute approximate value functions more efficiently is presented and a new factored discretization of continuous variables that avoids the exponential blow-up of traditional approaches is described. Expand
A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, andExpand
Dynamic Programming and Optimal Control, Vol. II
A major revision of the second volume of a textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning andExpand
Toward Off-Policy Learning Control with Function Approximation
TLDR
The Greedy-GQ algorithm is an extension of recent work on gradient temporal-difference learning to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. Expand
Linear Program Approximations for Factored Continuous-State Markov Decision Processes
TLDR
It is argued that this approach offers a robust alternative for solving high dimensional continuous-state space problems and is supported by experiments on three CMDP problems with 24-25 continuous state factors. Expand
Online learning for linearly parametrized control problems
In a discrete-time online control problem, a learner makes an effort to control the state of an initially unknown environment so as to minimize the sum of the losses he suffers, where the losses areExpand
...
1
2
3
4
5
...