Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

  title={Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering},
  author={Or Honovich and Leshem Choshen and Roee Aharoni and Ella Neeman and Idan Szpektor and Omri Abend},
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the source text they rely on. As a consequence, such models are unreliable, limiting their real-world applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization (Durmus et al., 2020; Wang et al., 2020), we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models using automatic question… 
LaMDA: Language Models for Dialog Applications
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
ComSum: Commit Messages Summarization and Meaning Preservation
ComSum, a data set of 7 million commit messages for text summa1 rization, is presented, and it is proposed to not only evaluate out6 puts by Rouge, but by their meaning preservation.
DialFact: A Benchmark for Fact-Checking in Dialogue
DIALFACT, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia, is constructed and a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue is proposed.
Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation
This paper presents a human annotation setup to identify three different response types: responses that are factually consistent with respect to the input knowledge, responses that contain hallucinated knowledge, and non-verifiable chitchat style responses.


The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
D dodecaDialogue is introduced, a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings.
Bleu: a Method for Automatic Evaluation of Machine Translation
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text.
Dialogue Natural Language Inference
This paper proposes a method which demonstrates that a model trained on Dialogue NLI can be used to improve the consistency of a dialogue model, and evaluates the method with human evaluation and with automatic metrics on a suite of evaluation sets designed to measure a dialoguemodel’s consistency.
Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
Topical-Chat is introduced, a knowledge-grounded humanhuman conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in opendomain conversational AI.
Wizard of Wikipedia: Knowledge-Powered Conversational agents
The best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while a new benchmark allows for measuring further improvements in this important research direction.
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
An automatic question answering (QA) based metric for faithfulness, FEQA, is proposed, which leverages recent advances in reading comprehension and has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark
The Benchmark for Evaluation of Grounded INteraction (BEGIN) consists of 8113 dialogue turns generated by language-model-based dialogue systems, accompanied by humans annotations specifying the relationship between the system’s response and the background information.
Improving Factual Consistency of Abstractive Summarization via Question Answering
This paper presents an approach to address factual consistency in summarization, and proposes an efficient automatic evaluation metric to measure factual consistency and a novel learning algorithm that maximizes the proposed metric during model training.