POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

  title={POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling},
  author={Zeyang Liu and Ke Zhou and Jiaxin Mao and Max L. Wilson},
  journal={Proceedings of the 30th ACM International Conference on Information \& Knowledge Management},
  • Zeyang Liu, K. Zhou, Max L. Wilson
  • Published 7 September 2021
  • Computer Science
  • Proceedings of the 30th ACM International Conference on Information & Knowledge Management
Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE, a simple… 

SCAI-QReCC Shared Task on Conversational Question Answering

This report discusses each subtask, but emphasizes the answer generation subtask as it attracted the most attention from the participants and identified evaluation of answer correctness in the conversational settings as a major challenge and acurrent research gap.

Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges

The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research.



Meta-evaluation of Conversational Search Evaluation Metrics

This work establishes the most comprehensive meta-evaluation for conversational search metrics considering all three perspectives, and proves that adapted session-based evaluation metrics can be used to measure multi-turn Conversational search, achieving moderate concordance with user satisfaction.

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

An evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores and it is shown that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level.

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

RUBER is proposed, a Referenced metric and Unreferenced metrics Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance).

How Am I Doing?: Evaluating Conversational Search Systems Offline

This work proposes a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model.

Wizard of Wikipedia: Knowledge-Powered Conversational agents

The best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while a new benchmark allows for measuring further improvements in this important research direction.

Learning an Unreferenced Metric for Online Dialogue Evaluation

This work proposes an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them and shows that the model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Using contextualized word embeddings to compute more accurate relatedness scores and thus better evaluation metrics is explored, and experiments show that the evaluation metrics outperform RUBER, which is trained on staticembeddings.

uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems

A fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υBLEU, which first collects diverse reference responses from massive dialogue data and then annotates their quality judgments by using a neural network trained on automatically collected training data.

Bootstrapping Dialog Systems with Word Embeddings

This work investigates the use of word embeddings in a text classification task with little training data and proposes a simple alternative, vector extrema, to replace the usual averaging of a sentence’s vectors.

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

This work investigates evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available and shows that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain.