Corpus ID: 231986109

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

  title={Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach},
  author={Haoming Jiang and Bo Dai and Mengjiao Yang and Wei Wei and Tuo Zhao},
Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation… Expand
A Free Lunch from the Noise: Provable and Practical Exploration for Representation Learning
  • Tongzheng Ren, Tianjun Zhang, Csaba Szepesv'ari, Bo Dai
  • Computer Science, Mathematics
  • ArXiv
  • 2021
Representation learning lies at the heart of the empirical success of deep learning for dealing with the curse of dimensionality. However, the power of representation learning has not been fullyExpand
TRAIL: Near-Optimal Imitation Learning with Suboptimal Data
This work presents training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space and proposes TRAIL, an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. Expand
User Response and Sentiment Prediction for Automatic Dialogue Evaluation
Automatic evaluation is beneficial for open-domain dialog system development. However, standard word-overlap metrics (BLEU, ROUGE) do not correlate well with human judgements of open-domain dialogExpand


Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
An evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores and it is shown that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. Expand
Toward Learning and Evaluation of Dialogue Policies with Text Examples
A dialogue collection and enrichment framework that is designed to explore the learning and evaluation of dialogue policies for simple conversational characters using textual training data and introduces an automatic policy evaluation metric that recognizes the validity of multiple conversational responses at each point in a dialogue. Expand
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
RUBER, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance) and which has a high correlation with human annotation. Expand
uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems
A fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υBLEU, which first collects diverse reference responses from massive dialogue data and then annotates their quality judgments by using a neural network trained on automatically collected training data. Expand
AirDialogue: An Environment for Goal-Oriented Dialogue Research
AirDialogue is presented, a large dataset that contains 301,427 goal-oriented conversations and its experimental results indicate that state-of-the-art dialogue models can only achieve a score of 0.17 while humans can reach a scoreOf 0.91, which suggests significant opportunities for future improvement. Expand
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems
This paper proposes a novel and feasible learning-based metric that can significantly improve the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE. Expand
Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems, identifying their shortcomings while accumulating evidence towards the most effective evaluation dimensions. Expand
Evaluating Coherence in Dialogue Systems using Entailment
Results show that interpretable metrics for evaluating topic coherence by making use of distributed sentence representations can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the responses. Expand
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
It is shown that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05). Expand
Deep Reinforcement Learning for Dialogue Generation
This work simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity, non-repetitive turns, coherence, and ease of answering. Expand