Towards Best Experiment Design for Evaluating Dialogue System Output

  title={Towards Best Experiment Design for Evaluating Dialogue System Output},
  author={Sashank Santhanam and Samira Shaikh},
  booktitle={International Conference on Natural Language Generation},
To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human… 

Tables from this paper

Achieving Reliable Human Assessment of Open-Domain Dialogue Systems

Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

This paper conducts two evaluation experiments on two aspects of summaries’ linguistic quality to compare Likert-type and ranking annotations and shows that best choice of evaluation method can vary from one aspect to another.

Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics

The objective of this study is to investigate conversational agents, their design approaches and evaluation metrics, which can help to better understand the overall process of dialog system development, and future possibilities to enhance user experience.

Dynamic Human Evaluation for Relative Model Comparisons

This work proposes an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study, and results indicate that a decision about the superior model can be made with high probability across differentlabelling strategies.

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains.

Local Knowledge Powered Conversational Agents

This work proposes a dialog framework that incorporates both local knowledge as well as users' past dialogues to generate high quality conversations and demonstrates that incorporating local knowledge can largely improve informativeness, coherency and realisticness measures using human evaluations.

Understanding Human Potentials for Evaluating Generative Models

Focusing on natural language generation, a method to dynamically measure the required human annotations when evaluating models in a relative comparison setting is proposed, ensuring sufficient labelling to reach a confident decision on the optimal model with high probability when comparing two generative models.

Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Through rigorous evaluations, both automated and human, it is demonstrated that decoupling the process into planning and realization performs better than an end-to-end approach.

TellMeWhy: A Dataset for Answering Why-Questions in Narratives

This work introduces TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described, and shows that state-of-the-art models are far below human performance on answering such questions.

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

This work crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling and presents a scoring algorithm for Best-w worst Scaling that is called value learning.



Evaluating Coherence in Dialogue Systems using Entailment

Results show that interpretable metrics for evaluating topic coherence by making use of distributed sentence representations can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the responses.

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

An evaluation model that learns to predict human-like scores to input responses, using a new dataset of human response scores is presented and it is shown that the ADEM model’s predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level.

On Evaluating and Comparing Open Domain Dialog Systems

This paper proposes a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement, and believes that this work is a step towards an automatic evaluation process for conversational AIs.

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

This work investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available and shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.

On Evaluating and Comparing Conversational Agents

This paper proposes a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement, and believes that this work is a step towards an automatic evaluation process for conversational AIs.

A Survey of Natural Language Generation Techniques with a Focus on Dialogue Systems - Past, Present and Future Directions

This work provides a comprehensive review towards building open domain dialogue systems, an important application of natural language generation, and finds that, predominantly, the approaches for building dialogue systems use seq2seq or language models architecture.

RankME: Reliable Human Ratings for Natural Language Generation

This work presents a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments, and shows that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods.

Comparing Rating Scales and Preference Judgements in Language Evaluation

This paper presents three pairs of evaluation experiments assessing text fluency and clarity for different data sets, and finds the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences.

Deep Reinforcement Learning for Dialogue Generation

This work simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that display three useful conversational properties: informativity, non-repetitive turns, coherence, and ease of answering.

Personalizing Dialogue Agents: I have a dog, do you have pets too?

This work collects data and train models tocondition on their given profile information; and information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction.