Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation

  title={Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation},
  author={Avinash Balakrishnan and Djallel Bouneffouf and Nicholas Mattei and Francesca Rossi},
AI systems that learn through reward feedback about the actions they take are increasingly deployed in domains that have significant impact on our daily life. In many cases the rewards should not be the only guiding criteria, as there are additional constraints and/or priorities imposed by regulations, values, preferences, or ethical principles. We detail a novel online system, based on an extension of the contextual bandits framework, that learns a set of behavioral constraints by observation… 

Figures from this paper

Incorporating Behavioral Constraints in Online AI Systems
This work details a novel online agent that learns a set of behavioral constraints by observation and uses these learned constraints as a guide when making decisions in an online setting while still being reactive to reward feedback.
Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation
A hierarchical adaptive contextual bandit method (HATCH) is proposed to conduct the policy learning of contextual bandits with a budget constraint and it is proved that HATCH achieves a regret bound as low as .
Contextual Bandit with Missing Rewards
Unlike standard contextual bandit methods, by leveraging clustering to estimate missing reward, this work is able to learn from each incoming event, even those with missing rewards.
CPMetric: Deep Siamese Networks for Metric Learning on Structured Preferences
A recently proposed metric for CP-nets is leveraged and a neural network architecture is proposed to learn an approximation of the metric, CPMetric, to look at how one can build a fast and flexible value alignment system.
Hyper-parameter Tuning for the Contextual Bandit
Two algorithms that uses a bandit to find the optimal exploration of the contextual bandit algorithm are presented, which the authors hope is the first step toward the automation of the multi-armed bandit algorithms.
UvA-DARE (Digital Academic Repository) A Deep Reinforcement Learning-Based Approach to Query-Free Interactive Target Item Retrieval
This work introduces an actor-critic framework to iteratively select sets of items based on real-time relevance feedback from users and their purchase history, thereby maximizing satisfaction with the entire session in the task of query-free interactive target item retrieval.
Joint Modeling of Local and Global Behavior Dynamics for Session-Based Recommendation
A Local-Global Session-based Recommendation framework–LGSR is proposed which generalizes the modeling of behavior dynamics from two perspectives and enables the representation learning of user behavior dynamics via jointly mapping local and global signals into the same latent space.
Recommending Movies on User's Current Preferences via Deep Neural Network
This work proposed, developed, and evaluated the recommendation engine based on user current preferences with the use of deep neural networks for cold case scenarios, resulting in accurate and personalized movies recommendation to user.
Learning Behavioral Soft Constraints from Demonstrations
A novel inverse reinforcement learning (IRL) method, Max Entropy Inverse Soft Constraint IRL (MESC-IRL), for learning implicit hard and soft constraints over states, actions, and state features from demonstrations in deterministic and non-deterministic environments modeled as Markov Decision Processes (MDPs).
Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration
Inverse reinforcement learning is used to learn constraints, that are then combined with a possibly orthogonal value function through the use of a contextual bandit-based orchestra that picks a contextually-appropriate choice between the two policies (constraint-based and environment reward-based) when taking actions.


Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits
This is the first work that shows how to achieve logarithmic regret in constrained contextual bandits and sheds light on the study of computationally efficient algorithms for general constrained contextual bands.
Contextual Bandits with Linear Payoff Functions
An O (√ Td ln (KT ln(T )/δ) ) regret bound is proved that holds with probability 1− δ for the simplest known upper confidence bound algorithm for this problem.
Motivated Value Selection for Artificial Agents
The conditions under which motivated value selection is an issue for some types of agents are established, and an example of an `indifferent' agent that avoids it entirely is presented, which poses and solves an issue which has not been formally addressed in the literature.
The MovieLens Datasets: History and Context
The history of MovieLens and the MovieLens datasets is documents, including a discussion of lessons learned from running a long-standing, live research platform from the perspective of a research organization, and best practices and limitations of using the Movie Lens datasets in new research are documented.
Reinforcement Learning: An Introduction
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges.
  • S. Villar, J. Bowden, J. Wason
  • Computer Science
    Statistical science : a review journal of the Institute of Mathematical Statistics
  • 2015
A novel bandit-based patient allocation rule is proposed that overcomes the issue of low power, thus removing a potential barrier for their use in practice and indicating that bandit approaches offer significant advantages and severe limitations in terms of their resulting statistical power.
Playing Atari with Deep Reinforcement Learning
This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Machine learning
Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
"Sorry, I Can't Do That": Developing Mechanisms to Appropriately Reject Directives in Human-Robot Interactions
Initial work that has been done in the DIARC/ADE cognitive robotic architecture to enable a directive rejection and explanation mechanism is presented, showing its operation in asimple HRI scenario.
Advances in Applied Mathematics
A concise and factual abstract is required. The abstract should state briefly the purpose of the research, the principal results and major conclusions. An abstract is often presented separately from