Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

  title={Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog},
  author={Ryuichi Takanobu and Hanlin Zhu and Minlie Huang},
Dialog policy decides what. [] Key Method The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.

Figures and Tables from this paper

Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition

This work proposes Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents and uses the actor-critic framework to facilitate pretraining and improve scalability.

Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation

This work proposes a novel reward learning approach for semi-supervised policy learning that outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset and learns a dynamics model as the reward function which models dialogue progress based on expert demonstrations.

"Think Before You Speak": Improving Multi-Action Dialog Policy by Planning Single-Action Dialogs

The proposed PEDP method employs model-based planning for conceiving what to express before deciding the current response through simulating single-action dialogs, and achieves a solid task success rate of 90.6%, improving 3% compared to the state-of-the-art methods.

Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management

A multi-level reward modeling approach that factorizes a reward into a three-level hierarchy: domain, act, and slot is proposed that can provide more accurate and explainable reward signals for state-action pairs in dialogs.

Integrating Pretrained Language Model for Dialogue Policy Learning

This work decomposes the adversarial training into two steps: a pre-trained language model is integrated as a discriminator to judge whether the current system action is good enough for the last user action and an extra local dense reward is given to guide the agent’s exploration.

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

Empirical studies with two benchmarks indicate that the model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgment.

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

Empirical studies with two benchmarks indicate that the model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgment.

Learning Dialog Policies from Weak Demonstrations

This work leverages dialog data to guide the agent to successfully respond to a user’s requests, and introduces Reinforced Fine-tune Learning, an extension to DQfD, enabling it to overcome the domain gap between the datasets and the environment.

Variational Reward Estimator Bottleneck: Learning Robust Reward Estimator for Multi-Domain Task-Oriented Dialog

The Variational Reward estimator Bottleneck (VRB) is proposed, which is an effective regularization method that aims to constrain unproductive information flows between inputs and the reward estimator.

Efficient Dialogue Complementary Policy Learning via Deep Q-network Policy and Episodic Memory Policy

A novel complementary policy learning (CPL) framework is proposed, which exploits the complementary advantages of the episodic memory (EM) policy and the deep Q-network (DQN) policy to achieve fast and effective dialogue policy learning.



Adversarial Learning of Task-Oriented Neural Dialog Models

This work proposes an adversarial learning method for reward estimation in reinforcement learning (RL) based task-oriented dialog models and shows that the proposed adversarial dialog learning method achieves advanced dialog success rate comparing to strong baseline methods.

Temporal supervised learning for inferring a dialog policy from example conversations

This paper introduces a new algorithm called Temporal Supervised Learning which learns directly from example dialogs, while also taking proper account of planning, and shows that a dialog manager trained with temporal supervised learning substantially outperforms a baseline trained using conventional supervised learning.

Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning

This paper addresses the travel planning task by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales.

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems

An on-line learning framework whereby the dialogue policy is jointly trained alongside the reward model via active learning with a Gaussian process model is proposed.

Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning

A novel agent-aware dropout Deep Q-Network (AAD-DQN) is proposed to address the problem of when to consult the teacher and how to learn from the teacher’s experiences and can significantly improve both safety and efficiency of on-line policy optimization compared to other companion learning approaches.

Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning

This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly

Policy Networks with Two-Stage Training for Dialogue Systems

This paper shows that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods and shows that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently.

End-to-End Reinforcement Learning of Dialogue Agents for Information Access

This paper proposes KB-InfoBot -- a multi-turn dialogue agent which helps users search Knowledge Bases (KBs) without composing complicated queries. Such goal-oriented dialogue agents typically need

Feudal Reinforcement Learning for Dialogue Management in Large Domains

A novel Dialogue Management architecture, based on Feudal RL, is proposed, which decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a secondstep where a primitive action is chosen from the selected subset.

Learning the Reward Model of Dialogue POMDPs from Data

A novel inverse reinforcement learning (IRL) algorithm is introduced for learning the reward function of the dialogue POMDP model and is shown to be higher than expert performance in non, low, and medium noise levels, but the high noise level.