• Corpus ID: 235694361

Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble

@article{Lee2021OfflinetoOnlineRL,
  title={Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble},
  author={Seunghyun Lee and Younggyo Seo and Kimin Lee and P. Abbeel and Jinwoo Shin},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.00591}
}
: Recent advance in deep offline reinforcement learning (RL) has made it possible to train strong robotic agents from offline datasets. However, depend-ing on the quality of the trained agents and the application being considered, it is often desirable to fine-tune such agents via further online interactions. In this paper, we observe that state-action distribution shift may lead to severe bootstrap error during fine-tuning, which destroys the good initial policy obtained via offline RL. To address… 

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

TLDR
A randomized ensemble of Q functions is used to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent’s performance and training stability and yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.

Mildly Conservative Q-Learning for Offline Reinforcement Learning

TLDR
This paper proposes Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values and theoretically shows that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OODactions.

Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters

TLDR
It is demonstrated that while some very efficient variants also outperform current state-of-the-art methods, they do not match the performance and robustness of MSG with deep ensembles, and investigates whether efficient approximations can be similarly effective.

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

TLDR
This paper proposes Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints that conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty.

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

TLDR
This work empirically observes that conservative offline RL algorithms do not work well in the multi-agent setting and proposes a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), which combines the first-order policy gradients and zeroth-order optimization methods to better optimize the conservative value functions over the actor parameters.

How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation

TLDR
This work develops two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand, and investigates ways to minimize online interactions in a target task, by reusing a suboptimal policy.

Online Decision Transformer

TLDR
Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework, is proposed and shown to be competitive with the state-of-the-art in absolute performance on the D4RL benchmark.

Beyond Tabula Rasa: Reincarnating Reinforcement Learning

TLDR
This work argues for an alternate approach to RL research, which could significantly improve real-world RL adoption and help democratize it further, and focuses on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone valuebased RL agent.

C^2: Co-design of Robots via Concurrent Networks Coupling Online and Offline Reinforcement Learning

TLDR
It is shown that Co- adaptation ignores the existence of exploration error during training and state-action distribution shift during parameter transmitting, which hurt the performance, and the framework of the concurrent network that couples online and offline RL methods is proposed, which illustrates that the proposed method is an effective way of discovering the optimal combination of morphology and controller.

Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space

TLDR
Planning to Practice (PTP) is proposed, a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve.

References

SHOWING 1-10 OF 48 REFERENCES

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

TLDR
A novel backup operator, Expected-Max Q-Learning (EMaQ), which naturally restricts learned policies to remain within the support of the offline dataset without any explicit regularization, while retaining desirable theoretical properties such as contraction is presented.

Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

TLDR
This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.

Conservative Q-Learning for Offline Reinforcement Learning

TLDR
Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

TLDR
SUNRISE is a simple unified ensemble method, which is compatible with various off-policy RL algorithms and significantly improves the performance of existing off-Policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments.

A Minimalist Approach to Offline Reinforcement Learning

TLDR
It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.

MOReL : Model-Based Offline Reinforcement Learning

TLDR
Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

TLDR
This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase.

Accelerating Online Reinforcement Learning with Offline Datasets

TLDR
A novel algorithm is proposed that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of reinforcement learning policies.

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

TLDR
A novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that can effectively optimize a policy offline using 10-20 times fewer data than prior works, and is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency.

MOPO: Model-based Offline Policy Optimization

TLDR
A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.