Policy Optimization with Model-based Explorations

  title={Policy Optimization with Model-based Explorations},
  author={Feiyang Pan and Qingpeng Cai and Anxiang Zeng and Chun-Xiang Pan and Qing Da and Hua-Lin He and Qing He and Pingzhong Tang},
Model-free reinforcement learning methods such as the Proximal Policy Optimization algorithm (PPO) have successfully applied in complex decision-making problems such as Atari games. However, these methods suffer from high variances and high sample complexity. On the other hand, model-based reinforcement learning methods that learn the transition dynamics are more sample efficient, but they often suffer from the bias of the transition estimation. How to make use of both model-based and model… 

Figures and Tables from this paper

Proximal policy optimization with model-based methods
PPOMM adds the information of the next state to the objective function of the proximal policy optimization (PPO) algorithm through a model-based method, which outperforms the state-of-the-art PPO algorithm when evaluated across 49 Atari games in the Arcade Learning Environment (ALE).
Zero Shot Learning on Simulated Robots
It is demonstrated that not only is training on the self-model far more data efficient than learning even a single task, but also that it allows for learning new tasks without necessitating any additional data collection, essentially allowing zero-shot learning of new tasks.
Learn Continuously, Act Discretely: Hybrid Action-Space Reinforcement Learning For Optimal Execution
A hybrid RL method that first use a continuous control agent to scope an action subset, then deploy a fine-grained agent to choose a specific limit price and significantly outperforms previous learning-based methods for order execution.
Trust the Model When It Is Confident: Masked Model-based Actor-Critic
It is shown theoretically that if the use of model-generated data is restricted to state-action pairs where the model error is small, the performance gap between model and real rollouts can be reduced and this work proposes Masked Model-based Actor-Critic (M2AC), a novel policy optimization algorithm that maximizes a model-based lower-bound of the true value function.
Field-aware Calibration: A Simple and Empirically Strong Method for Reliable Probabilistic Predictions
This paper introduces a new evaluation metric named field-level calibration error that measures the bias in predictions over the sensitive input field that the decision-maker concerns and proposes Neural Calibration, a simple yet powerful post-hoc calibration method that learns to calibrate by making full use of the field-aware information over the validation set.
Warm Up Cold-start Advertisements: Improving CTR Predictions via Learning to Learn ID Embeddings
Experimental results showed that Meta-Embedding can significantly improve both the cold-start and warm-up performances for six existing CTR prediction models, ranging from lightweight models such as Factorization Machines to complicated deep modelssuch as PNN and DeepFM.
GoChat: Goal-oriented Chatbots with Hierarchical Reinforcement Learning
GoChat proposes Goal-oriented Chatbots (GoChat), a framework for end-to-end training the chatbot to maximize the long-term return from offline multi-turn dialogue datasets, which outperforms previous methods on both the quality of response generation as well as the success rate of accomplishing the goal.
Reading Like HER: Human Reading Inspired Extractive Summarization
This work re-examine the problem of extractive text summarization for long documents as a contextual-bandit problem and solves it with policy gradient, adopting a convolutional neural network to encode gist of paragraphs for rough reading, and a decision making policy with an adapted termination mechanism for careful reading.


Model-Ensemble Trust-Region Policy Optimization
This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Continuous Deep Q-Learning with Model-based Acceleration
This paper derives a continuous variant of the Q-learning algorithm, which it is called normalized advantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods, and substantially improves performance on a set of simulated robotic control tasks.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning
A simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks, and is found that simple hash functions can achieve surprisingly good results on many challenging tasks.
Continuous control with deep reinforcement learning
This work presents an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces, and demonstrates that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
VIME: Variational Information Maximizing Exploration
VIME is introduced, an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics which efficiently handles continuous state and action spaces and can be applied with several different underlying RL algorithms.
From Pixels to Torques: Policy Learning with Deep Dynamical Models
This paper introduces a data-efficient, model-based reinforcement learning algorithm that learns a closed-loop control policy from pixel information only, and facilitates fully autonomous learning from pixels to torques.
Guided Policy Search via Approximate Mirror Descent
A new guided policy search algorithm is derived that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and it is shown that in the more general nonlinear setting, the error in the projection step can be bounded.