Efficient Dialog Policy Learning via Positive Memory Retention

@article{Zhao2018EfficientDP,
  title={Efficient Dialog Policy Learning via Positive Memory Retention},
  author={Rui Zhao and Volker Tresp},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
  year={2018},
  pages={823-830}
}
  • Rui Zhao, Volker Tresp
  • Published 2 October 2018
  • Computer Science
  • 2018 IEEE Spoken Language Technology Workshop (SLT)
This paper is concerned with the training of recurrent neural networks as goal-oriented dialog agents using reinforcement learning. Training such agents with policy gradients typically requires a large amount of samples. However, the collection of the required data in form of conversations between chatbots and human agents is time-consuming and expensive. To mitigate this problem, we describe an efficient policy gradient method using positive memory retention, which significantly increases the… 

Figures and Tables from this paper

Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient

TLDR
A class of novel temperature-based extensions for policy gradient methods, which are referred to as Tempered Policy Gradients (TPGs), are proposed, which improve the performance of commonly used policy-based dialogue agents by around 5% and helps produce more convincing utterances.

Curiosity-Driven Experience Prioritization via Density Estimation

TLDR
A novel Curiosity-Driven Prioritization (CDP) framework to encourage the agent to over-sample those trajectories that have rare achieved goal states and the experimental results show that CDP improves both performance and sample-efficiency of reinforcement learning agents, compared to state-of-the-art methods.

Mutual Information-based State-Control for Intrinsically Motivated Reinforcement Learning

TLDR
This work proposes to formulate an intrinsic objective as the mutual information between the goal states and the controllable states, which encourages the agent to take control of its environment.

Guessing State Tracking for Visual Dialogue

TLDR
A guessing state tracking based guess model for the Guesser, which significantly outperforms previous models, achieves new state-of-the-art, and especially the success rate of guessing 83.3% is approaching the human-level accuracy of 84.4%.

Related Work to Neural Natural-Language Template Matching

TLDR
A novel method is proposed which learns how to match a natural language template to the utterances, extracting information from the utterance in the process, and how this approach differs from existing work in neural semantic parsing and sentence matching.

AutoScale: Energy Efficiency Optimization for Stochastic Edge Inference Using Reinforcement Learning

TLDR
This paper proposes AutoScale, an adaptive and lightweight execution scaling engine built on the custom-designed reinforcement learning algorithm that continuously learns and selects the most energy efficient inference execution target by considering characteristics of neural networks and available systems in the collaborative cloud-edge execution environment while adapting to stochastic runtime variance.

Learning Individualized Treatment Rules with Estimated Translated Inverse Propensity Score

TLDR
This paper focuses on learning individualized treatment rules (ITRs) to derive a treatment policy that is expected to generate a better outcome for an individual patient, and casts ITRs learning as a contextual bandit problem and minimize the expected risk of the treatment policy.

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning

TLDR
A novel multi-goal RL objective based on weighted entropy is proposed, which encourages the agent to maximize the expected return, as well as to achieve more diverse goals and a maximum entropy-based prioritization framework is developed to optimize the proposed objective.

Energy-Based Hindsight Experience Prioritization

TLDR
An energy-based framework for prioritizing hindsight experience in robotic manipulation tasks, inspired by the work-energy principle in physics, that hypothesizes that replaying episodes that have high trajectory energy is more effective for reinforcement learning in robotics.

References

SHOWING 1-10 OF 56 REFERENCES

Sample-efficient Deep Reinforcement Learning for Dialog Control

TLDR
This paper presents 3 methods for reducing the number of dialogs required to optimize an RNN-based dialog policy with RL by maintaining a second RNN which predicts the value of the current policy, and to apply experience replay to both networks.

Learning Goal-Oriented Visual Dialog via Tempered Policy Gradient

TLDR
A class of novel temperature-based extensions for policy gradient methods, which are referred to as Tempered Policy Gradients (TPGs), are proposed, which improve the performance of commonly used policy-based dialogue agents by around 5% and helps produce more convincing utterances.

Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking

TLDR
This work introduces an exploration technique based on Thompson sampling, drawing Monte Carlo samples from a Bayes-by-backprop neural network, demonstrating marked improvement over common approaches such as -greedy and Boltzmann exploration.

Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking

TLDR
This work introduces an exploration technique based on Thompson sampling, drawing Monte Carlo samples from a Bayes-by-backprop neural network, demonstrating marked improvement over common approaches such as $\epsilon$-greedy and Boltzmann exploration.

Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management

TLDR
A practical approach to learn deep RL-based dialogue policies and demonstrate their effectiveness in a task-oriented information seeking domain is demonstrated.

BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems

TLDR
A new algorithm is presented that significantly improves the efficiency of exploration for deep Q-learning agents in dialogue systems and shows that spiking the replay buffer with experiences from just a few successful episodes can make Q- learning feasible when it might otherwise fail.

Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning

TLDR
A novel agent-aware dropout Deep Q-Network (AAD-DQN) is proposed to address the problem of when to consult the teacher and how to learn from the teacher’s experiences and can significantly improve both safety and efficiency of on-line policy optimization compared to other companion learning approaches.

The Reactor: A Sample-Efficient Actor-Critic Architecture

TLDR
A new reinforcement learning agent, called Reactor (for Retraceactor), based on an off-policy multi-step return actor-critic architecture that is sample-efficient thanks to the use of memory replay, and numerical efficient since it uses multi- step returns.

Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning

This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

TLDR
This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision.
...