• Corpus ID: 239769234

SCORE: Spurious COrrelation REduction for Offline Reinforcement Learning

@article{Deng2021SCORESC,
  title={SCORE: Spurious COrrelation REduction for Offline Reinforcement Learning},
  author={Zhihong Deng and Zuyue Fu and Lingxiao Wang and Zhuoran Yang and Chenjia Bai and Zhaoran Wang and Jing Jiang},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.12468}
}
Offline reinforcement learning (RL) aims to learn the optimal policy from a pre-collected dataset without online interactions. Most of the existing studies focus on distributional shift caused by out-of-distribution actions. However, even in-distribution actions can raise serious problems. Since the dataset only contains limited information about the underlying model, offline RL is vulnerable to spurious correlations, i.e., the agent tends to prefer actions that by chance lead to high returns… 

Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination

This paper proposes to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check, and introduces conservatism by trusting samples that the forward model and backward model agree on.

References

SHOWING 1-10 OF 40 REFERENCES

Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

MOPO: Model-based Offline Policy Optimization

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly, is proposed and observed that UWAC substantially improves model stability during training.

Is Pessimism Provably Efficient for Offline RL?

A pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function and establishes a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs).

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

A novel backup operator, Expected-Max Q-Learning (EMaQ), which naturally restricts learned policies to remain within the support of the offline dataset without any explicit regularization, while retaining desirable theoretical properties such as contraction is presented.

Offline Reinforcement Learning with Fisher Divergence Critic Regularization

This work parameterizes the critic as the logbehavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network, and term the resulting algorithm Fisher-BRC (Behavior Regularized Critic), which achieves both improved performance and faster convergence over existing state-of-the-art methods.

MOReL : Model-Based Offline Reinforcement Learning

Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.

A Minimalist Approach to Offline Reinforcement Learning

It is shown that the performance of state-of-the-art RL algorithms can be matched by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data, and the resulting algorithm is a simple to implement and tune baseline.