PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems

@article{Lan2020PONEAN,
  title={PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems},
  author={Tian Lan and Xianling Mao and Wei Wei and Xiaoyan Gao and Heyan Huang},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.02399}
}
Open-domain generative dialogue systems have attracted considerable attention over the past few years. Currently, how to automatically evaluate them, is still a big challenge problem. As far as we know, there are three kinds of automatic methods to evaluate the open-domain generative dialogue systems: (1) Word-overlap-based metrics; (2) Embedding-based metrics; (3) Learning-based metrics. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. In this… Expand
Enhancing the Open-Domain Dialogue Evaluation in Latent Space
  • Zhangming Chan, Lemao Liu, +4 authors Rui Yan
  • Computer Science
  • FINDINGS
  • 2021
TLDR
Experimental results on two real-world dialogue datasets confirm the superiority of the self-supervised method for open-domain dialogue evaluation, where both Pearson and Spearman correlations with human judgments outperform all baselines. Expand
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems
TLDR
The proposed metric, USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy, achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics. Expand
POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling
  • Zeyang Liu, Ke Zhou, Jiaxin Mao, Max L. Wilson
  • Computer Science
  • ArXiv
  • 2021
Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with searchExpand
DynaEval: Unifying Turn and Dialogue Level Evaluation
TLDR
DynaEval is proposed, a unified automatic evaluation framework which is not only capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue. Expand
Meta-evaluation of Conversational Search Evaluation Metrics
TLDR
This work systematically meta-evaluate a variety of conversational search metrics and establishes METEOR as the best existing single-turn metric considering all three perspectives, and demonstrates that adapted session-based evaluation metrics can be used to measure multi-turn Conversational search, achieving moderate concordance with user satisfaction. Expand
A Survey of Dialogue System Evaluation
  • Yifan Fan, Xudong Luo
  • Computer Science
  • 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)
  • 2020
TLDR
Some essential criteria and widely used methods for evaluating dialogue systems are surveyed, focusing on the latest research progress on this topic, and machine learning based evaluation method and deep learning based ones are discussed. Expand
Context-Controlled Topic-Aware Neural Response Generation for Open-Domain Dialog Systems
TLDR
A Context-Controlled Topic-Aware neural response generation model, i.e., CCTA, which makes dialog context interact with the process of topic representing and transiting to achieve balanced improvements on response informativeness and contextual coherence and finds that topic transition modeling can work as an auxiliary learning task to boost the response generation. Expand
Task Intelligence for Search and Recommendation
  • C. Shah, Ryen W. White
  • Computer Science
  • 2021
TLDR
This data indicates that information access issues that involve solving tasks a day-to-day in the field of search and recommendation are still challenging and opportunities to address are still available. Expand
A Comprehensive Assessment of Dialog Evaluation Metrics
TLDR
A comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets is provided, which suggests how to best assess evaluation metrics and indicates promising directions for future work. Expand
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach
TLDR
A new framework named ENIGMA is proposed for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning, which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Expand

References

SHOWING 1-10 OF 55 REFERENCES
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings
TLDR
Using contextualized word embeddings to compute more accurate relatedness scores and thus better evaluation metrics is explored, and experiments show that the evaluation metrics outperform RUBER, which is trained on staticembeddings. Expand
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
TLDR
RUBER, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance) and which has a high correlation with human annotation. Expand
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
TLDR
An evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores and it is shown that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. Expand
One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning
TLDR
Experiments show that the adversarial multi-task neural metric (ADVMT) achieves a high correlation with human annotation, which yields better performance than monolingual ones and various existing metrics. Expand
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
TLDR
This work investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available and shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. Expand
Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems
TLDR
It is demonstrated that human annotators have high agreement on assessing utterance-level engagement scores and that these scores can improve automatic evaluation metrics for open-domain dialogue systems, as shown by correlation with human judgements. Expand
Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
TLDR
The recently proposed hierarchical recurrent encoder-decoder neural network is extended to the dialogue domain, and it is demonstrated that this model is competitive with state-of-the-art neural language models and back-off n-gram models. Expand
Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding
TLDR
A sequence-to-sequence generation based data augmentation framework that leverages one utterance’s same semantic alternatives in the training data to produce diverse utterances that help to improve the language understanding module. Expand
Towards Implicit Content-Introducing for Generative Short-Text Conversation Systems
TLDR
This paper proposes an implicit content-introducing method which incorporates additional information into the Seq2Seq model in a flexible way and fuse the general decoding and the auxiliary cue word information through the proposed hierarchical gated fusion unit. Expand
Towards a Human-like Open-Domain Chatbot
TLDR
Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations, is presented and a human evaluation metric called Sensibleness and Specificity Average (SSA) is proposed, which captures key elements of a human-like multi- turn conversation. Expand
...
1
2
3
4
5
...