# The Simplex Method is Strongly Polynomial for Deterministic Markov Decision Processes

@inproceedings{Post2012TheSM, title={The Simplex Method is Strongly Polynomial for Deterministic Markov Decision Processes}, author={Ian Post and Yinyu Ye}, booktitle={Mathematics of Operations Research}, year={2012} }

We prove that the simplex method with the highest gain/most-negative-reduced cost pivoting rule converges in strongly polynomial time for deterministic Markov decision processes (MDPs) regardless of the discount factor. For a deterministic MDP with n states and m actions, we prove the simplex method runs in O(n3m2 log2 n) iterations if the discount factor is uniform and O(n5m3 log2 n) iterations if each action has a distinct discount factor. Previously the simplex method was known to run in…

## 42 Citations

### The Complexity of the Simplex Method

- Computer Science, MathematicsSTOC
- 2015

This paper uses the known connection between Markov decision processes (MDPs) and linear programming, and an equivalence between Dantzig's pivot rule and a natural variant of policy iteration for average-reward MDPs to prove that it is PSPACE-complete to find the solution that is computed by the simplex method using Dantzes' pivot rule.

### Polynomial Time Algorithms for Branching Markov Decision Processes and Probabilistic Min(Max) Polynomial Bellman Equations

- Computer Science, Mathematics
- 2018

The first polynomial time algorithm for computing, to any desired precision, optimal (maximum and minimum) extinction probabilities for BMDPs is obtained, based on a novel generalization of Newton’s method.

### Geometric Policy Iteration for Markov Decision Processes

- Computer ScienceKDD
- 2022

This work proposes a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs and proves that the complexity of GPI achieves the best known bound O|𝓐|over 1 - γ log 1 over 1-γ of policy iteration.

### Improved Strong Worst-case Upper Bounds for MDP Planning

- Computer ScienceIJCAI
- 2017

This paper generalise a contrasting algorithm called the Fibonacci Seesaw, and derive a bound of poly(n, k) · k, which is a template to map algorithms for the 2action setting to the general setting and can also be used to design Policy Iteration algorithms with a running time upper bound ofpoly(n), k)·k.

### Improved Strongly Polynomial Algorithms for Deterministic MDPs, 2VPI Feasibility, and Discounted All-Pairs Shortest Paths

- Mathematics, Computer ScienceSODA
- 2022

A randomized trade-off algorithm solving the problem of finding optimal strategies for deterministic Markov Decision Processes (DMDPs) and a closely related problem of testing feasibility of systems of m linear inequalities on n real variables with at most two variables per inequality (2VPI) is given.

### Recent Progress on the Complexity of Solving Markov Decision Processes

- Mathematics, Computer Science
- 2014

The model, the two optimality criteria the authors consider (discounted and average rewards), the classical value iteration, policy iteration algorithms, and how to find an optimal policy via linear programming are defined.

### Partial Policy Iteration for L1-Robust Markov Decision Processes

- Computer ScienceJ. Mach. Learn. Res.
- 2021

This paper proposes partial policy iteration, a new, efficient, flexible, and general policy iteration scheme for robust MDPs, and proposes fast methods for computing the robust Bellman operator in quasi-linear time, nearly matching the linear complexity the non-robust Bellman operators.

### Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

- Computer Science, MathematicsMath. Oper. Res.
- 2016

Under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, it is shown that Howard's PI terminates after at most n(m - 1) = O(n2m(τt + τr) iterations, which generalizes a recent result for deterministic MDPs.

### Randomised Procedures for Initialising and Switching Actions in Policy Iteration

- MathematicsAAAI
- 2016

A routine to find a good initial policy for PI and a randomised action-switching rule for PI, which admits a bound of (2 + ln(k – 1)n on the expected number of iterations, which is the tightest complexity bound known for PI.

### Batch-Switching Policy Iteration

- Computer ScienceIJCAI
- 2016

Batch-Switching Policy Iteration (BSPI), a family of deterministic PI algorithms that switches states in "batches", taking the batch size b as a parameter is introduced, and a bound of O(1.6479n) on the number of iterations taken by an instance of BSPI is believed to be the tightest bound shown yet for any variant of PI.

## References

SHOWING 1-10 OF 25 REFERENCES

### The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

- MathematicsMath. Oper. Res.
- 2011

It is proved that the classic policy-iteration method and the original simplex method with the most-negative-reduced-cost pivoting rule of Dantzig are strongly polynomial-time algorithms for solving the Markov decision problem (MDP) with a fixed discount rate.

### A New Complexity Result on Solving the Markov Decision Problem

- Computer Science, MathematicsMath. Oper. Res.
- 2005

This is the first strongly polynomial-time algorithm for solving the Markov decision problem when the discount factor is a constant less than 1 and the method is a combinatorial interior-point method related to the work of Ye.

### The Complexity of Markov Decision Processes

- Computer ScienceMath. Oper. Res.
- 1987

All three variants of the classical problem of optimal policy computation in Markov decision processes, finite horizon, infinite horizon discounted, and infinite horizon average cost are shown to be complete for P, and therefore most likely cannot be solved by highly parallel algorithms.

### On policy iteration as a Newton's method and polynomial policy iteration algorithms

- Computer ScienceAAAI/IAAI
- 2002

This paper improves the upper bounds to a polynomial for policy iteration on MDP problems with special graph structure based on the connection between policy iteration and Newton's method for finding the zero of a convex function.

### Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor

- Computer ScienceJACM
- 2013

This work improves the bound given by Ye and shows that the same bound applies to the number of iterations performed by the strategy iteration algorithm, a generalization of Howard’s policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards.

### Subexponential lower bounds for randomized pivoting rules for the simplex algorithm

- Computer Science, MathematicsSTOC '11
- 2011

Lower bounds for Random-Edge and Random-Facet lower bounds for randomized pivoting rules were known before only in abstract settings, and not for concrete linear programs.

### Lower Bounds for Howard's Algorithm for Finding Minimum Mean-Cost Cycles

- Computer ScienceISAAC
- 2010

This work provides the first weighted graphs on which Howard’s algorithm performs Ω(n 2) iterations, where n is the number of vertices in the graph.

### On the Complexity of Policy Iteration

- Computer Science, MathematicsUAI
- 1999

This paper proves the first such non-trivial, worst-case, upper bounds on the number of iterations required by PI to converge to the optimal policy.

### A Subexponential Lower Bound for Zadeh's Pivoting Rule for Solving Linear Programs and Games

- Computer ScienceIPCO
- 2011

The first subexponential lower bound of the form 2Ω(√n) lower bound is obtained by utilizing connections between pivoting steps performed by simplex-based algorithms and improving switches performed by policy iteration algorithms for 1-player and 2-player games.

### Markov Decision Processes: Discrete Stochastic Dynamic Programming

- Computer ScienceWiley Series in Probability and Statistics
- 1994

Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.