Corpus ID: 222133336

Policy Learning Using Weak Supervision

  title={Policy Learning Using Weak Supervision},
  author={Jingkang Wang and Hongyi Guo and Zhaowei Zhu and Yang Liu},
Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert's demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervisions" as… Expand

Figures and Tables from this paper


Efficient Reductions for Imitation Learning
This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros. Expand
Deep Q-learning From Demonstrations
This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstrating data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism. Expand
Reinforcement Learning from Demonstration through Shaping
This paper investigates the intersection of reinforcement learning and expert demonstrations, leveraging the theoretical guarantees provided by reinforcement learning, and using expert demonstrations to speed up this learning by biasing exploration through a process called reward shaping. Expand
On the sample complexity of reinforcement learning.
Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time. Expand
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
This paper proposes a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting and demonstrates that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem. Expand
Hybrid Reinforcement Learning with Expert State Sequences
This paper proposes a novel tensor-based model to infer the unobserved actions of the expert state sequences of an expert, while the expert actions are unobserved, and proposes a hybrid objective combining reinforcement learning and imitation learning. Expand
Co-training for Policy Learning
This work presents a meta-algorithm for co-training for sequential decision making, and proves the effectiveness of the approach across a wide range of tasks, including discrete/continuous control and combinatorial optimization. Expand
Weakly-Supervised Reinforcement Learning for Controllable Behavior
This work introduces a framework for using weak supervision to automatically disentangle this semantically meaningful subspace of tasks from the enormous space of nonsensical "chaff" tasks, and shows that this learned subspace enables efficient exploration and provides a representation that captures distance between states. Expand
Reinforcement Learning with Perturbed Rewards
This work develops a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed, and shows that trained policies based on the estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. Expand
Reinforcement Learning with a Corrupted Reward Channel
This work formalises this problem as a generalised Markov Decision Problem called Corrupt Reward MDP, and finds that by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions. Expand