Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

@inproceedings{Takanobu2019GuidedDP,
  title={Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog},
  author={Ryuichi Takanobu and Hanlin Zhu and Minlie Huang},
  booktitle={EMNLP},
  year={2019}
}
Dialog policy decides what. [] Key Method The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.

Figures and Tables from this paper

Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition
TLDR
This work proposes Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents and uses the actor-critic framework to facilitate pretraining and improve scalability.
Semi-Supervised Dialogue Policy Learning via Stochastic Reward Estimation
TLDR
This work proposes a novel reward learning approach for semi-supervised policy learning that outperforms competitive policy learning baselines on MultiWOZ, a benchmark multi-domain dataset and learns a dynamics model as the reward function which models dialogue progress based on expert demonstrations.
Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management
TLDR
A multi-level reward modeling approach that factorizes a reward into a three-level hierarchy: domain, act, and slot is proposed that can provide more accurate and explainable reward signals for state-action pairs in dialogs.
Integrating Pretrained Language Model for Dialogue Policy Learning
TLDR
This work decomposes the adversarial training into two steps: a pre-trained language model is integrated as a discriminator to judge whether the current system action is good enough for the last user action and an extra local dense reward is given to guide the agent’s exploration.
Guided Dialogue Policy Learning without Adversarial Learning in the Loop
TLDR
The proposed decompose the adversarial training into two steps, which achieves a remarkable task success rate using both on-policy and off-policy reinforcement learning methods and has potential to transfer knowledge from existing domains to a new domain.
WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue
TLDR
Empirical studies with two benchmarks indicate that the model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgment.
WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue
TLDR
Empirical studies with two benchmarks indicate that the model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgment.
Learning Dialog Policies from Weak Demonstrations
TLDR
This work leverages dialog data to guide the agent to successfully respond to a user’s requests, and introduces Reinforced Fine-tune Learning, an extension to DQfD, enabling it to overcome the domain gap between the datasets and the environment.
Variational Reward Estimator Bottleneck: Learning Robust Reward Estimator for Multi-Domain Task-Oriented Dialog
TLDR
The Variational Reward estimator Bottleneck (VRB) is proposed, which is an effective regularization method that aims to constrain unproductive information flows between inputs and the reward estimator.
Efficient Dialogue Complementary Policy Learning via Deep Q-network Policy and Episodic Memory Policy
TLDR
A novel complementary policy learning (CPL) framework is proposed, which exploits the complementary advantages of the episodic memory (EM) policy and the deep Q-network (DQN) policy to achieve fast and effective dialogue policy learning.
...
...

References

SHOWING 1-10 OF 38 REFERENCES
Adversarial Learning of Task-Oriented Neural Dialog Models
TLDR
This work proposes an adversarial learning method for reward estimation in reinforcement learning (RL) based task-oriented dialog models and shows that the proposed adversarial dialog learning method achieves advanced dialog success rate comparing to strong baseline methods.
Temporal supervised learning for inferring a dialog policy from example conversations
TLDR
This paper introduces a new algorithm called Temporal Supervised Learning which learns directly from example dialogs, while also taking proper account of planning, and shows that a dialog manager trained with temporal supervised learning substantially outperforms a baseline trained using conventional supervised learning.
Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning
TLDR
This paper addresses the travel planning task by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales.
On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems
TLDR
An on-line learning framework whereby the dialogue policy is jointly trained alongside the reward model via active learning with a Gaussian process model is proposed.
Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning
TLDR
A novel agent-aware dropout Deep Q-Network (AAD-DQN) is proposed to address the problem of when to consult the teacher and how to learn from the teacher’s experiences and can significantly improve both safety and efficiency of on-line policy optimization compared to other companion learning approaches.
Policy Networks with Two-Stage Training for Dialogue Systems
TLDR
This paper shows that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods and shows that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently.
End-to-End Reinforcement Learning of Dialogue Agents for Information Access
This paper proposes KB-InfoBot -- a multi-turn dialogue agent which helps users search Knowledge Bases (KBs) without composing complicated queries. Such goal-oriented dialogue agents typically need
Scaling up deep reinforcement learning for multi-domain dialogue systems
TLDR
Experimental results comparing DQN versus NDQN using simulations report that the proposed method exhibits better scalability and is promising for optimising the behaviour of multi-domain dialogue systems.
Feudal Reinforcement Learning for Dialogue Management in Large Domains
TLDR
A novel Dialogue Management architecture, based on Feudal RL, is proposed, which decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a secondstep where a primitive action is chosen from the selected subset.
Learning the Reward Model of Dialogue POMDPs from Data
TLDR
A novel inverse reinforcement learning (IRL) algorithm is introduced for learning the reward function of the dialogue POMDP model and is shown to be higher than expert performance in non, low, and medium noise levels, but the high noise level.
...
...