On the Complexity of Adversarial Decision Making

  title={On the Complexity of Adversarial Decision Making},
  author={Dylan J. Foster and Alexander Rakhlin and Ayush Sekhari and Karthik Sridharan},
A central problem in online learning and decision making—from bandits to reinforcement learning—is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show—via new upper and lower bounds—that the Decision-Estimation Coefficient, a complexity measure… 

Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning

Two new DEC-type complexity measures are proposed: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC) which are shown to be necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning.



RL for Latent MDPs: Regret Guarantees and a Lower Bound

This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.

Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits

This work studies no-regret learning in Markov games with adversarial opponents when competing against the best policy in hindsight, and proves a statistical hardness result even in the most favorable scenario when both above conditions are true.

The Statistical Complexity of Interactive Decision Making

The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning and constitutes a theory of learnability for interactive decision making.

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

This work considers the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies Π that may not contain any nearoptimal policy and provides an algorithm whose error is bounded in terms of the rank d of the underlying MDP.

Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches

Focusing on the special case of factored MDPs, this work proves an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with the algorithmic results, demonstrates exponential separation between model-based and model- free RL in some rich-observation settings.

Learning to Optimize via Posterior Sampling

A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

Better Algorithms for Benign Bandits

A new algorithm is proposed for the bandit linear optimization problem which obtains a regret bound of O (√Q), where Q is the total variation in the cost functions, and shows that it is possible to incur much less regret in a slowly changing environment even in theBandit setting.

An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then

Kernel-based methods for bandit convex optimization

We consider the adversarial convex bandit problem and we build the first poly(T)-time algorithm with poly(n) √T-regret for this problem. To do so we introduce three new ideas in the derivative-free

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

A new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise and a new, computationally efficient algorithm with linear function approximation named UCRL-VTR for the aforementioned linear mixture MDPs in the episodic undiscounted setting are proposed.