Corpus ID: 221266021

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

  title={How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics},
  author={Prasanna Parthasarathi and Joelle Pineau and Sarath Chandar},
Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be… Expand
Sometimes We Want Translationese
This paper proposes a simple, novel way to quantify whether an NMT system exhibits robustness and faithfulness, focusing on the case of word-order perturbations, and explores a suite of functions to perturb the word order of source sentences without deleting or injecting tokens. Expand
A Survey of Evaluation Metrics Used for NLG Systems
A coherent taxonomy of the evaluation metrics is provided to organize the existing metrics and to better understand the developments in the field to improve the automatic evaluation metrics. Expand


Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
An evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores and it is shown that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. Expand
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
This work investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available and shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. Expand
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons
A novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment, resulting in better tests. Expand
Learning an Unreferenced Metric for Online Dialogue Evaluation
This work proposes an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them and shows that the model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference. Expand
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
RUBER, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance) and which has a high correlation with human annotation. Expand
Wizard of Wikipedia: Knowledge-Powered Conversational agents
The best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while a new benchmark allows for measuring further improvements in this important research direction. Expand
Frames: a corpus for adding memory to goal-oriented dialogue systems
A rule-based baseline is proposed and the frame tracking task is proposed, which consists of keeping track of different semantic frames throughout each dialogue, and the task is analysed through this baseline. Expand
CoQA: A Conversational Question Answering Challenge
CoQA is introduced, a novel dataset for building Conversational Question Answering systems and it is shown that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning). Expand
Incorporating Unstructured Textual Knowledge Sources into Neural Dialogue Systems
We present initial methods for incorporating unstructured external textual information into neural dialogue systems for predicting the next utterance of a user in a two-party chat conversation. TheExpand
Deep Reinforcement Learning for Dialogue Generation
This work simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity, non-repetitive turns, coherence, and ease of answering. Expand