• Corpus ID: 232147811

Reinforcement Learning, Bit by Bit

  title={Reinforcement Learning, Bit by Bit},
  author={Xiuyuan Lu and Benjamin Van Roy and Vikranth Reddy Dwaracherla and Morteza Ibrahimi and Ian Osband and Zheng Wen},
Reinforcement learning agents have demonstrated remarkable achievements in simulated environments. Data efficiency poses an impediment to carrying this success over to real environments. The design of data-efficient agents calls for a deeper understanding of information acquisition and representation. We discuss concepts and regret analysis that together offer principled guidance. This line of thinking sheds light on questions of what information to seek , how to seek that information , and what… 

The Value of Information When Deciding What to Learn

This work addresses this shortcoming directly to couple optimal information acquisition with the optimal design of learning targets and offers new insights into learning targets from the literature on rate-distortion theory before turning to empirical results that confirm the value of information when deciding what to learn.

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

For the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem are developed and established, and both asymptotic and nonasymptotic convergence results with global optimality guarantees are established.

The Neural Testbed: Evaluating Predictive Distributions

It is shown that the quality of joint predictions drives performance in downstream decision tasks, and the importance of this observation to the community is highlighted.

Model-Value Inconsistency as a Signal for Epistemic Uncertainty

This work provides empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful as a signal for exploration, for acting safely under distribution shifts, and for robustifying value-based planning with a learned model.

Learning to Represent State with Perceptual Schemata

Empirical results show that Perceptual Schemata enables a state representation that can maintain multiple objects observed in sequence with independent dynamics while an LSTM cannot, and can generalize more gracefully to larger environments with more distractor objects, while anLSTM quickly overfits to the training tasks.

From Predictions to Decisions: The Importance of Joint Predictive Distributions

This work establishes several results pertaining to combinatorial decision problems, sequential predictions, and multi-armed bandits to elucidate the essential role of joint predictive distributions.

Gaussian Imagination in Bandit Learning

The results formalize the folklore that so-called Bayesian agents remain effective when instantiated with diffuse misspecified distributions.

Contextual Information-Directed Sampling

It is provably demonstrate the advantage of contextual IDS over conditional IDS and emphasize the importance of considering the context distribution and the main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen contexts while the conditionalIDS can be myopic.

On Rate-Distortion Theory in Capacity-Limited Cognition & Reinforcement Learning

This paper aims to elucidate this latter perspective by presenting a brief survey of information-theoretic models of capacity-limited decision making in biological and artificial agents.

Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

An algorithm is introduced that iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model, and an information-theoretic, Bayesian regret bound is proved for this algorithm that holds for anyinite-horizon, episodic sequential decision-making problem.



Information-Directed Sampling for Reinforcement Learning

This report extends Information-Directed Sampling to solving general reinforcement learning problems, with the hope of more reliable regret guarantees and proposes practical algorithms for solving these problems.

Information-Directed Exploration for Deep Reinforcement Learning

This work builds on recent advances in distributional reinforcement learning and proposes a novel, tractable approximation of IDS for deep Q-learning and explicitly accounts for both parametric uncertainty and heteroscedastic observation noise.

Information-Theoretic Confidence Bounds for Reinforcement Learning

By making a connection between information-theoretic quantities and confidence bounds, this work obtains results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff.

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

An Bayesian expected regret bound for PSRL in finite-horizon episodic Markov decision processes is established, which improves upon the best previous bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm.

Behaviour Suite for Reinforcement Learning

This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement

(More) Efficient Reinforcement Learning via Posterior Sampling

An O(τS/√AT) bound on expected regret is established, one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

Randomized Prior Functions for Deep Reinforcement Learning

It is shown that this approach is efficient with linear representations, provides simple illustrations of its efficacy with nonlinear representations and scales to large-scale problems far better than previous attempts.

Variational Bayesian Reinforcement Learning with Regret Bounds

It is shown that an agent imbued with an epistemic-risk-seeking utility function is able to explore efficiently, as measured by regret, and the resulting algorithm K-learning is competitive with other state-of-the-art algorithms in practice.

Is Q-learning Provably Efficient?

Q-learning with UCB exploration achieves regret in an episodic MDP setting, and this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

Hybrid Reward Architecture for Reinforcement Learning

A new method is proposed, called Hybrid Reward Architecture (HRA), which takes as input a decomposed reward function and learns a separate value function for each component reward function, enabling more effective learning.