# Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

@article{Lee2019TsallisRL, title={Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning}, author={Kyungjae Lee and Sungyub Kim and Sungbin Lim and Sungjoon Choi and Songhwai Oh}, journal={ArXiv}, year={2019}, volume={abs/1902.00137} }

In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the original RL problem and RL with various types of entropy, including the well-known standard Shannon-Gibbs (SG) entropy, using an additional real-valued parameter, called an entropic index. By controlling the entropic index, we can generate… Expand

#### Supplemental Code

#### Figures and Topics from this paper

#### 18 Citations

Maximum Entropy RL (Provably) Solves Some Robust RL Problems

- Computer Science
- ArXiv
- 2021

This paper proves theoretically that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function, and provides the first rigorous proof and theoretical characterization of the MaxEnt RL robust set. Expand

Hamilton-Jacobi-Bellman Equations for Maximum Entropy Optimal Control

- Mathematics, Computer Science
- ArXiv
- 2020

The resulting algorithms are the first data-driven control methods that use an information theoretic exploration mechanism in continuous time and are shown to enhance the regularity of the viscosity solution and to be asymptotically consistent as the effect of entropy regularization diminished. Expand

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning

- Computer Science
- NeurIPS
- 2019

A generic method to devise regularization forms and propose off-policy actor critic algorithms in complex environment settings is provided and a full mathematical analysis of the proposed regularized MDPs are conducted. Expand

Entropic Regularization of Markov Decision Processes

- Computer Science, Mathematics
- Entropy
- 2019

A broader family of f-divergences is considered, and more concretely α-diversgences are considered, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Expand

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

- Computer Science, Mathematics
- ArXiv
- 2021

A generalized policy mirror descent (GPMD) algorithm for solving regularized RL that converges linearly over an entire range of learning rates, in a dimension-free fashion, to the global solution, even when the regularizer lacks strong convexity and smoothness. Expand

Promoting Stochasticity for Expressive Policies via a Simple and Efficient Regularization Method

- Computer Science
- NeurIPS
- 2020

This work proposes a novel regularization method that is compatible with a broad range of expressive policy architectures and shows that it outperforms state-of-the-art regularized RL methods in continuous control tasks. Expand

Weighted Entropy Modification for Soft Actor-Critic

- Computer Science
- ArXiv
- 2020

An algorithm motivated for self-balancing exploration with the introduced weight function is proposed, which leads to state-of-the-art performance on Mujoco tasks despite its simplicity in implementation. Expand

Variational Inference MPC using Tsallis Divergence

- Computer Science
- Robotics: Science and Systems
- 2021

A novel Tsallis Variational Inference-Model Predictive Control algorithm is derived that allows for effective control of the cost/reward transform and is characterized by superior performance in terms of mean and variance reduction of the associated cost. Expand

The Gradient Convergence Bound of Federated Multi-Agent Reinforcement Learning with Efficient Communication

- Computer Science
- ArXiv
- 2021

A utility function to consider the balance between reducing communication overheads and improving convergence performance is proposed and two new optimization methods on top of variation-aware periodic averaging methods are developed. Expand

A Functional Mirror Descent Perspective on Reinforcement Learning

- 2020

Functional mirror descent offers a unifying perspective on optimization of statistical models and provides numerous advantages for the design and analysis of learning algorithms. It brings the… Expand

#### References

SHOWING 1-10 OF 31 REFERENCES

Path Consistency Learning in Tsallis Entropy Regularized MDPs

- Computer Science, Mathematics
- ICML
- 2018

A class of novel path consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse ERL problem that can work with both on-policy and off-policy data, and is empirically compared with its soft counterpart, and shows its advantage, especially in problems with a large number of actions. Expand

Effective Exploration for Deep Reinforcement Learning via Bootstrapped Q-Ensembles under Tsallis Entropy Regularization

- Computer Science, Mathematics
- ArXiv
- 2018

A new DRL algorithm is developed that seamless integrates entropy-induced and bootstrap-induced techniques for efficient and deep exploration of the learning environment and is efficient in exploring actions with clear exploration value. Expand

Modeling purposeful adaptive behavior with the principle of maximum causal entropy

- Computer Science
- 2010

The principle of maximum causal entropy is introduced, a general technique for applying information theory to decision-theoretic, game-the theoretical, and control settings where relevant information is sequentially revealed over time. Expand

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

- Computer Science, Mathematics
- ICML
- 2018

This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods. Expand

Sparse Markov Decision Processes With Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning

- Computer Science, Mathematics
- IEEE Robotics and Automation Letters
- 2018

A sparse Markov decision process (MDP) with novel causal sparse Tsallis entropy regularization with outperforms existing methods in terms of the convergence speed and performance and a sparse value iteration method that solves a sparse MDP and proves the convergence and optimality of sparse value iterations using the Banach fixed-point theorem is proposed. Expand

Continuous Deep Q-Learning with Model-based Acceleration

- Computer Science
- ICML
- 2016

This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks. Expand

Bridging the Gap Between Value and Policy Based Reinforcement Learning

- Computer Science, Mathematics
- NIPS
- 2017

A new RL algorithm, Path Consistency Learning (PCL), is developed that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces and significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks. Expand

Boltzmann Exploration Done Right

- Computer Science, Mathematics
- NIPS
- 2017

This paper presents a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon $T$ and the suboptimality gap $\Delta$). Expand

Reinforcement Learning with Deep Energy-Based Policies

- Computer Science
- ICML
- 2017

A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied. Expand

Equivalence Between Policy Gradients and Soft Q-Learning

- Computer Science, Mathematics
- ArXiv
- 2017

There is a precise equivalence between Q-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, and it is shown that "soft" $Q-learning is exactly equivalent to a policy gradient method. Expand