LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing

  title={LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing},
  author={Yu Li and Josh Arnold and Feifan Yan and Weiyan Shi and Zhou Yu},
We present LEGOEval, an open-source toolkit that enables researchers to easily evaluate dialogue systems in a few lines of code using the online crowdsource platform, Amazon Mechanical Turk. Compared to existing toolkits, LEGOEval features a flexible task design by providing a Python API that maps to commonly used React.js interface components. Researchers can personalize their evaluation procedures easily with our built-in pages as if playing with LEGO blocks. Thus, LEGOEval provides a fast… 

Figures and Tables from this paper

ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction
This paper presents a novel parallel GEC dataset drawn from open-domain chatbot conversations; this dataset is, to the authors' knowledge, the first GEC datasets targeted to a conversational setting and uses the annotated data to fine-tune a state-of-theart GEC model, resulting in a 16 point increase in model precision.
The R-U-A-Robot Dataset: Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity
This work explores how both a generative research model (Blender) as well as two deployed systems (Amazon Alexa, Google Assistant) handle this intent, finding that systems often fail to confirm their nonhuman identity.
Quality Assessment Methods for Textual Conversational Interfaces: A Multivocal Literature Review
A systematic Multivocal Literature Review (MLR) is conducted, on five different literature sources, to provide a view on quality attributes, evaluation frameworks, and evaluation datasets proposed to provide aid to the researchers and practitioners of the field.


ChatEval: A Tool for Chatbot Evaluation
A unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems and open-source baseline models and evaluation datasets are introduced.
ParlAI: A Dialog Research Software Platform
ParlAI (pronounced “par-lay”), an open-source software platform for dialog research implemented in Python, is introduced, to provide a unified framework for sharing, training and testing dialog models; integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning.
DialCrowd: A toolkit for easy dialog system assessment
DialCrowd has been designed to make system assessment easier and to ensure the quality of the result, and is described, what specific needs it fulfills and how it works.
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons
A novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment, resulting in better tests.
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
An evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores and it is shown that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level.
Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
The recently proposed hierarchical recurrent encoder-decoder neural network is extended to the dialogue domain, and it is demonstrated that this model is competitive with state-of-the-art neural language models and back-off n-gram models.
Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems, identifying their shortcomings while accumulating evidence towards the most effective evaluation dimensions.
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
This work investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available and shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.
Survey on evaluation methods for dialogue systems
This paper distinguishes between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems) by introducing the main technologies developed for the dialogue systems and then presenting the evaluation methods regarding that class.
Towards a Human-like Open-Domain Chatbot
Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations, is presented and a human evaluation metric called Sensibleness and Specificity Average (SSA) is proposed, which captures key elements of a human-like multi- turn conversation.