A Comprehensive Assessment of Dialog Evaluation Metrics

  title={A Comprehensive Assessment of Dialog Evaluation Metrics},
  author={Yi-Ting Yeh and Maxine Esk{\'e}nazi and Shikib Mehri},
Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper… 

Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric

This proposed metric achieves state-of-the-art performance on the HUMOD dataset while reducing measured sensitivity to dataset by 37%-66% without fine-tuning a pretrained language model, and using only 3,750 unannotated human dialogues and a single negative example.

FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows

This work develops the first consensus-based dialogue evaluation framework, FlowEval, which provides a reference-free approach for dialog evaluation by finding pseudo-references and proposes segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.

Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges

The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research.

Spurious Correlations in Reference-Free Evaluation of Text Generation

Evidence is found that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious relationships with measures such as word overlap,plexity, and length, and it is observed that for text summarization, these metrics have high error rates.

What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation

This work proposes to use information that can be automatically extracted from the next user utterance, such as its sentiment or whether the user explicitly ends the conversation, as a proxy to measure the quality of the previous system response.

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

The proposed MDD-Eval framework first train a teacher evaluator with human-annotated data to acquire a rating skill to tell good dialogue responses from bad ones in a particular domain and then, adopt a self-training strategy to train a new evaluators with teacher-annotation multi-domain data, that helps the newevaluator to generalize across multiple domains.

A Review of Quality Assurance Research of Dialogue Systems

With the development of machine learning and big data technology, dialogue systems have been applied to many fields, including aerospace, banking and other scenarios that require high accuracy of

Distribution Aware Metrics for Conditional Natural Language Generation

This work proposes a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each.

Open-Domain Dialog Evaluation using Follow-Ups Likelihood

This paper presents a new automated evaluation method using follow-ups: it measures the probability that a language model will continue the conversation with a set ofFollow-ups, and achieves the highest correlation with human evaluations.

SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation

Experimental results on multiple benchmarks show that SelF-Eval is highly consistent with human evaluations and better than the state-of-the-art models.



Unsupervised Evaluation of Interactive Dialog with DialoGPT

The FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision is introduced, which attains moderate to strong correlation with human judgement at both levels.

PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

This paper proposes a novel and feasible learning-based metric that can significantly improve the correlation with human judgments by using augmented POsitive samples and valuable NEgative samples, called PONE.

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog and is shown to strongly correlate with human judgment on both Topical-Chat and PersonaChat.

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

An evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores and it is shown that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level.

Assessing Dialogue Systems with Distribution Distances

This paper proposes to measure the performance of a dialogue system by computing the distributionwise distance between its generated conversations and real-world conversations, and develops and evaluates two distribution-wise metrics.

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

This work investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available and shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Using contextualized word embeddings to compute more accurate relatedness scores and thus better evaluation metrics is explored, and experiments show that the evaluation metrics outperform RUBER, which is trained on staticembeddings.

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

The proposed metric, USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy, achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics.

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

It is demonstrated that human annotators have high agreement on assessing utterance-level engagement scores and that these scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements.