On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes

  title={On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes},
  author={Mary Melekopoglou and Anne Condon},
  journal={INFORMS J. Comput.},
We consider the complexity of the policy improvement algorithm for Markov decision processes. We show that four variants of the algorithm require exponential time in the worst case. INFORMS Journal on Computing , ISSN 1091-9856, was published as ORSA Journal on Computing from 1989 to 1995 under ISSN 0899-1499. 

Figures from this paper

Exponential Lower Bounds for Policy Iteration
This work extends lower bounds to Markov decision processes with the total reward and average-reward optimality criteria to show policy iteration style algorithms have exponential lower bounds in a two player game setting.
Recent Progress on the Complexity of Solving Markov Decision Processes
The model, the two optimality criteria the authors consider (discounted and average rewards), the classical value iteration, policy iteration algorithms, and how to find an optimal policy via linear programming are defined.
Analysis of Lower Bounds for Simple Policy Iteration
A novel exponential lower bound on the number of iterations taken by policy iteration for N-$state, $k-$action MDPs is proved and an index-based switching rule is given that yields a strong lower bound of $\mathcal{O}\big((3+k)2^{N/2-3}\big)$.
The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
  • Y. Ye
  • Mathematics
    Math. Oper. Res.
  • 2011
It is proved that the classic policy-iteration method and the original simplex method with the most-negative-reduced-cost pivoting rule of Dantzig are strongly polynomial-time algorithms for solving the Markov decision problem (MDP) with a fixed discount rate.
On policy iteration as a Newton's method and polynomial policy iteration algorithms
This paper improves the upper bounds to a polynomial for policy iteration on MDP problems with special graph structure based on the connection between policy iteration and Newton's method for finding the zero of a convex function.
Complexity Estimates and Reductions to Discounting for Total and Average-Reward Markov Decision Processes and Stochastic Games
Of the Dissertation Complexity Estimates and Reductions to Discounting for Total and Average-Reward Markov Decision Processes and Stochastic Games shows clear trends in estimates of total and average-RewardMarkov decision processes and stochastic games complexity.
Improved Strong Worst-case Upper Bounds for MDP Planning
This paper generalise a contrasting algorithm called the Fibonacci Seesaw, and derive a bound of poly(n, k) · k, which is a template to map algorithms for the 2action setting to the general setting and can also be used to design Policy Iteration algorithms with a running time upper bound ofpoly(n), k)·k.
A policy-improvement type algorithm for solving zero-sum two-person stochastic games of perfect information
A policy-improvement type algorithm to locate an optimal pure stationary strategy for discounted stochastic games with perfect information and a graph theoretic motivation for the algorithm is presented.
Computational Models for Decision Making in Dynamic and Uncertain Domains
Three models of control in dynamic systems consisting of a nite set of states where decisions in uence state transitions are made, and control objectives and motivations for the models are explained.
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
  • B. Scherrer
  • Computer Science, Mathematics
    Math. Oper. Res.
  • 2016
Under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, it is shown that Howard's PI terminates after at most n(m - 1) = O(n2m(τt + τr) iterations, which generalizes a recent result for deterministic MDPs.


The Complexity of Stochastic Games
On the complexity of local search
The main results are these: Finding a local optimum under the Lin-Kernighan heuristic for the traveling salesman problem is PLS-complete, and a host of simple unweighted local optimality problems are P-complete.
By constructing long 'increasing' paths on appropriate convex polytopes, it is shown that the simplex algorithm for linear programs is not a 'good algorithm' in the sense of J. Edmonds.
Low order polynomial bounds on the expected performance of local improvement algorithms
  • C. Tovey
  • Computer Science
    Math. Program.
  • 1986
We present a general abstract model of local improvement, applicable to such diverse cases as principal pivoting methods for the linear complementarity problem and hill climbing in artificial
Stochastic Games*
  • L. Shapley
  • Mathematics
    Proceedings of the National Academy of Sciences
  • 1953
In a stochastic game the play proceeds by steps from position to position, according to transition probabilities controlled jointly by the two players, and the expected total gain or loss is bounded by M, which depends on N 2 + N matrices.
Computational complexity of probabilistic Turing machines
It is shown how probabilisticlinear-bounded automata can simulate nondeterministic linear-bounding automata and an example is given of a function computable more quickly by Probabilistic Turing machines than by deterministic Turing machines.
How easy is local search?
Dynamic Programming and Markov Processes