# Sparse Markov Decision Processes With Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning

@article{Lee2017SparseMD, title={Sparse Markov Decision Processes With Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning}, author={Kyungjae Lee and Sungjoon Choi and Songhwai Oh}, journal={IEEE Robotics and Automation Letters}, year={2017}, volume={3}, pages={1466-1473} }

In this letter, a sparse Markov decision process (MDP) with novel causal sparse Tsallis entropy regularization is proposed. The proposed policy regularization induces a sparse and multimodal optimal policy distribution of a sparse MDP. The full mathematical analysis of the proposed sparse MDP is provided. We first analyze the optimality condition of a sparse MDP. Then, we propose a sparse value iteration method that solves a sparse MDP and then prove the convergence and optimality of sparse…

## 42 Citations

### A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning

- Computer ScienceNeurIPS
- 2019

A generic method to devise regularization forms and propose off-policy actor critic algorithms in complex environment settings is provided and a full mathematical analysis of the proposed regularized MDPs are conducted.

### Maximum Causal Tsallis Entropy Imitation Learning

- Computer ScienceNeurIPS
- 2018

This paper proves that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score, and proposes a maximum causal Tsallis entropy imitation learning algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution.

### Sparse Actor-Critic: Sparse Tsallis Entropy Regularized Reinforcement Learning in a Continuous Action Space

- Computer Science2020 17th International Conference on Ubiquitous Robots (UR)
- 2020

This paper introduces a novel off-policy actor-critic reinforcement learning algorithm with a sparse Tsallis entropy regularizer that outperforms former on-policy and off-Policy RL algorithms in terms of the convergence speed and performance.

### Path Consistency Learning in Tsallis Entropy Regularized MDPs

- Computer ScienceICML
- 2018

A class of novel path consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse ERL problem that can work with both on-policy and off-policy data, and is empirically compared with its soft counterpart, and shows its advantage, especially in problems with a large number of actions.

### A Unified Framework for Regularized Reinforcement Learning

- Computer ScienceArXiv
- 2019

A general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term is proposed.

### Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

- Computer ScienceArXiv
- 2019

A new class of Markov decision processes (MDPs) with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL), and it is found that a different value of the entropic index is desirable for a different type of RL problems.

### Robust Entropy-regularized Markov Decision Processes

- Computer ScienceArXiv
- 2021

It is shown how the robust ER-MDP model framework and results can be integrated into different algorithmic schemes including value or (modified) policy iteration, which would lead to new robust RL and inverse RL algorithms to handle uncertainties.

### Twice regularized MDPs and the equivalence between robustness and regularization

- Computer ScienceNeurIPS
- 2021

It is established that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPS, and the corresponding Bellman operators enable developing policy iteration schemes with convergence and robustness guarantees.

### Adaptive Tsallis Entropy Regularization for Efficient Reinforcement Learning

- Computer Science2022 13th International Conference on Information and Communication Technology Convergence (ICTC)
- 2022

This paper presents adaptive regularization using Tsallis entropy (ART) for efficient exploration in reinforcement learning (RL) problem and proposes the condition of the optimal entropic index which has the smallest regret bound among all positive entropics index.

### Finding Near Optimal Policies via Reducive Regularization in Markov Decision Processes

- Computer Science
- 2021

It is proved that the iteration complexity to obtain an ε-optimal policy could be maintained or even reduced in comparison with setting a sufficiently small λ in both dynamic programming and policy gradient methods.

## References

SHOWING 1-10 OF 26 REFERENCES

### Infinite Time Horizon Maximum Causal Entropy Inverse Reinforcement Learning

- Computer ScienceIEEE Transactions on Automatic Control
- 2018

The maximum causal entropy framework is extended to the infinite time horizon setting and a gradient-based algorithm for the maximum discounted causal entropy formulation is developed that enjoys the desired feature of being model agnostic, a property that is absent in many previous IRL algorithms.

### Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy

- Computer Science
- 2010

The principle of maximum causal entropy is introduced, a general technique for applying information theory to decision-theoretic, game-the theoretical, and control settings where relevant information is sequentially revealed over time.

### From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

- Computer ScienceICML
- 2016

Sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities, is proposed, and an unexpected connection between this new loss and the Huber classification loss is revealed.

### Equivalence Between Policy Gradients and Soft Q-Learning

- Computer ScienceArXiv
- 2017

There is a precise equivalence between Q-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, and it is shown that "soft" $Q-learning is exactly equivalent to a policy gradient method.

### Deep Reinforcement Learning with Double Q-Learning

- Computer ScienceAAAI
- 2016

This paper proposes a specific adaptation to the DQN algorithm and shows that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.

### Reinforcement Learning with Deep Energy-Based Policies

- Computer ScienceICML
- 2017

A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied.

### Actor-Critic Reinforcement Learning with Energy-Based Policies

- Computer ScienceEWRL
- 2012

This work introduces the first sound and e"cient algorithm for training energy-based policies, based on an actorcritic architecture, that is computationally e-cient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.

### Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax

- Computer ScienceKI
- 2011

The results show that a VDBE-Softmax policy can outperform e-greedy, Softmax and VDBe policies in combination with on- and off-policy learning algorithms such as Q-learning and Sarsa.

### Control of a Quadrotor With Reinforcement Learning

- Computer ScienceIEEE Robotics and Automation Letters
- 2017

A method to control a quadrotor with a neural network trained using reinforcement learning techniques and a new learning algorithm that differs from the existing ones in certain aspects is presented, found that it is more applicable to controlling a Quadrotor than existing algorithms.

### Continuous control with deep reinforcement learning

- Computer ScienceICLR
- 2016

This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.