Multi-User Reinforcement Learning with Low Rank Rewards

@article{Agarwal2022MultiUserRL,
  title={Multi-User Reinforcement Learning with Low Rank Rewards},
  author={Naman Agarwal and Prateek Jain and S. Kowshik and Dheeraj M. Nagaraj and Praneeth Netrapalli},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.05355}
}
In this work, we consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the N users has a low-rank structure – a standard and practically successful assumption in the offline collaborative filtering setting – the question is can we design algorithms with significantly lower sample complexity compared… 

References

SHOWING 1-10 OF 35 REFERENCES

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

A class of MDPs that exhibit low rank structure, where the latent features are unknown are considered, and it is shown that if one must use the low-rank structure of the MDP to estimate part of the Q function, one must incur a sample complexity exponential in the horizon H to learn an near optimal policy.

Sample Complexity of Multi-task Reinforcement Learning

This paper introduces a new multi-task algorithm for a sequence of reinforcement-learning tasks when each task is sampled independently from (an unknown) distribution over a finite set of Markov decision processes whose parameters are initially unknown.

When Collaborative Filtering Meets Reinforcement Learning

This paper model the recommender-user interactive recommendation problem as an agent-environment RL task, which is mathematically described by a Markov decision process (MDP), and proposes a novel CF-based MDP to achieve collaborative recommendations for the entire user community.

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity

Multi-task Deep Reinforcement Learning with PopArt

This work proposes to automatically adapt the contribution of each task to the agent’s updates, so that all tasks have a similar impact on the learning dynamics, and learns a single trained policy that exceeds median human performance on this multi-task domain.

Near-optimal Representation Learning for Linear Bandits and Linear RL

A sample-efficient algorithm is proposed, MTLROFUL, which leverages the shared representation of M linear bandits to achieve regret, which significantly improves upon the baseline Õ(Md √ T ) achieved by solving each task independently.

Sharing Knowledge in Multi-Task Deep Reinforcement Learning

This work studies the benefit of sharing representations among tasks to enable the effective use of deep neural networks in Multi-Task Reinforcement Learning, and extends the well-known finite-time bounds of Approximate Value-Iteration to the multi-task setting.

Reward-Free Exploration for Reinforcement Learning

An efficient algorithm is given that conducts episodes of exploration and returns near-suboptimal policies for an arbitrary number of reward functions, and a nearly-matching $\Omega(S^2AH^2/\epsilon^2)$ lower bound is given, demonstrating the near-optimality of the algorithm in this setting.

Markov Decision Processes with Continuous Side Information

This work considers a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs and proposes algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context.

Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation

A simple, iterative learning algorithm that finds the optimal Q-function with sample complexity of $\widetilde{O}(\frac{1}{\epsilon^{\max(d_1, d_2)+2}})$ when the optimal $Q$-function has low rank and the discounting factor $\gamma$ is below a certain threshold provides an exponential improvement in sample complexity.