• Corpus ID: 232147811

# Reinforcement Learning, Bit by Bit

@article{Lu2021ReinforcementLB,
title={Reinforcement Learning, Bit by Bit},
author={Xiuyuan Lu and Benjamin Van Roy and Vikranth Reddy Dwaracherla and Morteza Ibrahimi and Ian Osband and Zheng Wen},
journal={ArXiv},
year={2021},
volume={abs/2103.04047}
}
• Published 6 March 2021
• Computer Science
• ArXiv
Reinforcement learning agents have demonstrated remarkable achievements in simulated environments. Data eﬃciency poses an impediment to carrying this success over to real environments. The design of data-eﬃcient agents calls for a deeper understanding of information acquisition and representation. We discuss concepts and regret analysis that together oﬀer principled guidance. This line of thinking sheds light on questions of what information to seek , how to seek that information , and what…

## Figures from this paper

• Computer Science
NeurIPS
• 2021
This work addresses this shortcoming directly to couple optimal information acquisition with the optimal design of learning targets and offers new insights into learning targets from the literature on rate-distortion theory before turning to empirical results that confirm the value of information when deciding what to learn.
• Computer Science
ArXiv
• 2022
For the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem are developed and established, and both asymptotic and nonasymptotic convergence results with global optimality guarantees are established.
• Computer Science
• 2022
It is shown that the quality of joint predictions drives performance in downstream decision tasks, and the importance of this observation to the community is highlighted.
• Computer Science, Economics
ICML
• 2022
This work provides empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful as a signal for exploration, for acting safely under distribution shifts, and for robustifying value-based planning with a learned model.
• Computer Science
• 2021
Empirical results show that Perceptual Schemata enables a state representation that can maintain multiple objects observed in sequence with independent dynamics while an LSTM cannot, and can generalize more gracefully to larger environments with more distractor objects, while anLSTM quickly overfits to the training tasks.
• Computer Science
• 2021
This work establishes several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions.
• Computer Science
ArXiv
• 2022
The results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.
• Computer Science
ICML
• 2022
It is provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution and the main message is that an intelligent agent should invest more on the actions that are beneﬁcial for the future unseen contexts while the conditionalIDS can be myopic.
• Computer Science
ArXiv
• 2022
This paper aims to elucidate this latter perspective by presenting a brief survey of information-theoretic models of capacity-limited decision making in biological and artiﬁcial agents.
• Computer Science
ArXiv
• 2022
An algorithm is introduced that iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model, and an information-theoretic, Bayesian regret bound is proved for this algorithm that holds for anyinite-horizon, episodic sequential decision-making problem.

## References

SHOWING 1-10 OF 109 REFERENCES

• Computer Science
• 2017
This report extends Information-Directed Sampling to solving general reinforcement learning problems, with the hope of more reliable regret guarantees and proposes practical algorithms for solving these problems.
• Computer Science
ICLR
• 2019
This work builds on recent advances in distributional reinforcement learning and proposes a novel, tractable approximation of IDS for deep Q-learning and explicitly accounts for both parametric uncertainty and heteroscedastic observation noise.
• Computer Science
NeurIPS
• 2019
By making a connection between information-theoretic quantities and confidence bounds, this work obtains results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff.
• Computer Science
ICML
• 2017
An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.
• Computer Science
ICLR
• 2020
This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement
• Computer Science
NIPS
• 2013
An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
• Computer Science
NeurIPS
• 2018
It is shown that this approach is efficient with linear representations, provides simple illustrations of its efficacy with nonlinear representations and scales to large-scale problems far better than previous attempts.
It is shown that an agent imbued with an epistemic-risk-seeking utility function is able to explore efficiently, as measured by regret, and the resulting algorithm K-learning is competitive with other state-of-the-art algorithms in practice.
• Computer Science
NeurIPS
• 2018
Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."
• Computer Science
NIPS
• 2017
A new method is proposed, called Hybrid Reward Architecture (HRA), which takes as input a decomposed reward function and learns a separate value function for each component reward function, enabling more effective learning.