Corpus ID: 233004507

Distributional Offline Continuous-Time Reinforcement Learning with Neural Physics-Informed PDEs (SciPhy RL for DOCTR-L)

  title={Distributional Offline Continuous-Time Reinforcement Learning with Neural Physics-Informed PDEs (SciPhy RL for DOCTR-L)},
  author={I. Halperin},
  • I. Halperin
  • Published 2021
  • Computer Science, Physics, Economics
  • ArXiv
This paper addresses distributional offline continuous-time reinforcement learning (DOCTR-L) with stochastic policies for high-dimensional optimal control. A soft distributional version of the classical Hamilton-Jacobi-Bellman (HJB) equation is given by a semilinear partial differential equation (PDE). This ‘soft HJB equation’ can be learned from offline data without assuming that the latter correspond to a previous optimal or near-optimal policy. A data-driven solution of the soft HJB equation… Expand

Figures from this paper


Reinforcement Learning in Continuous Time and Space
  • K. Doya
  • Mathematics, Medicine
  • Neural Computation
  • 2000
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB)Expand
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations
Abstract We introduce physics-informed neural networks – neural networks that are trained to solve supervised learning tasks while respecting any given laws of physics described by general nonlinearExpand
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Expand
Reinforcement Learning in Continuous Time and Space: A Stochastic Control Approach
A complete analysis of the problem in the linear–quadratic (LQ) setting is carried out and it is deduced that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian, which interprets the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Expand
Deep Learning-Based Numerical Methods for High-Dimensional Parabolic Partial Differential Equations and Backward Stochastic Differential Equations
We study a new algorithm for solving parabolic partial differential equations (PDEs) and backward stochastic differential equations (BSDEs) in high dimension, which is based on an analogy between theExpand
Hamilton-Jacobi-Bellman Equations for Maximum Entropy Optimal Control
The resulting algorithms are the first data-driven control methods that use an information theoretic exploration mechanism in continuous time and are shown to enhance the regularity of the viscosity solution and to be asymptotically consistent as the effect of entropy regularization diminished. Expand
Off-Policy Deep Reinforcement Learning without Exploration
This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. Expand
Risk-Sensitive Reinforcement Learning
A risk-sensitive Q-learning algorithm is derived, which is necessary for modeling human behavior when transition probabilities are unknown, and applied to quantify human behavior in a sequential investment task and is found to provide a significantly better fit to the behavioral data and leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. Expand
Solving high-dimensional partial differential equations using deep learning
A deep learning-based approach that can handle general high-dimensional parabolic PDEs using backward stochastic differential equations and the gradient of the unknown solution is approximated by neural networks, very much in the spirit of deep reinforcement learning with the gradient acting as the policy function. Expand
Risk-Averse Offline Reinforcement Learning
The Offline RiskAverse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting, is presented and it is demonstrated empirically that in the presence of natural distribution-shifts, O- RAAC learns policies with good average performance. Expand