On the Complexity of Adversarial Decision Making

@article{Foster2022OnTC,
  title={On the Complexity of Adversarial Decision Making},
  author={Dylan J. Foster and Alexander Rakhlin and Ayush Sekhari and Karthik Sridharan},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.13063}
}
A central problem in online learning and decision making—from bandits to reinforcement learning—is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show—via new upper and lower bounds—that the Decision-Estimation Coefficient, a complexity measure… 

Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning

Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently

References

SHOWING 1-10 OF 61 REFERENCES

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

TLDR
This work develops the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves $\mathcal{O}(log T)$ regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with $\tilde{O}}(\sqrt{T})$ regret even when the loses are adversarial, where $T$ is the number of episodes.

RL for Latent MDPs: Regret Guarantees and a Lower Bound

TLDR
This work considers the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP) and shows that the key link is a notion of separation between the MDP system dynamics, providing an efficient algorithm with local guarantee.

Learning Markov Games with Adversarial Opponents: Efficient Algorithms and Fundamental Limits

TLDR
This work studies no-regret learning in Markov games with adversarial opponents when competing against the best policy in hindsight, and proves a statistical hardness result even in the most favorable scenario when both above conditions are true.

The Statistical Complexity of Interactive Decision Making

TLDR
The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning and constitutes a theory of learnability for interactive decision making.

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

TLDR
This work considers the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies Π that may not contain any nearoptimal policy and provides an algorithm whose error is bounded in terms of the rank d of the underlying MDP.

Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches

TLDR
Focusing on the special case of factored MDPs, this work proves an exponential lower bound for a general class of model-free approaches, including OLIVE, which, when combined with the algorithmic results, demonstrates exponential separation between model-based and model- free RL in some rich-observation settings.

Learning to Optimize via Posterior Sampling

TLDR
A Bayesian regret bound for posterior sampling is made that applies broadly and can be specialized to many model classes and depends on a new notion the authors refer to as the eluder dimension, which measures the degree of dependence among action rewards.

Better Algorithms for Benign Bandits

TLDR
A new algorithm is proposed for the bandit linear optimization problem which obtains a regret bound of O (√Q), where Q is the total variation in the cost functions, and shows that it is possible to incur much less regret in a slowly changing environment even in theBandit setting.

An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then

Kernel-based methods for bandit convex optimization

We consider the adversarial convex bandit problem and we build the first poly(T)-time algorithm with poly(n) √T-regret for this problem. To do so we introduce three new ideas in the derivative-free
...