Corpus ID: 233289483

Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

@inproceedings{Honovich2021Q2EF,
  title={Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering},
  author={Or Honovich and Leshem Choshen and Roee Aharoni and Ella Neeman and Idan Szpektor and Omri Abend},
  booktitle={EMNLP},
  year={2021}
}
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the source text they rely on. As a consequence, such models are unreliable, limiting their real-world applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization (Durmus et al., 2020; Wang et al., 2020), we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models using automatic question… Expand
ComSum: Commit Messages Summarization and Meaning Preservation
TLDR
ComSum, a data set of 7 million commit messages for text summa1 rization, is presented, and it is proposed to not only evaluate out6 puts by Rouge, but by their meaning preservation. Expand
DialFact: A Benchmark for Fact-Checking in Dialogue
TLDR
DIALFACT, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia, is constructed and a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue is proposed. Expand
Rome was built in 1776: A Case Study on Factual Correctness in Knowledge-Grounded Response Generation
TLDR
This paper presents a human annotation setup to identify three different response types: responses that are factually consistent with respect to the input knowledge, responses that contain hallucinated knowledge, and non-verifiable chitchat style responses. Expand

References

SHOWING 1-10 OF 54 REFERENCES
The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
TLDR
D dodecaDialogue is introduced, a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. Expand
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. Expand
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
TLDR
QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text. Expand
Dialogue Natural Language Inference
TLDR
This paper proposes a method which demonstrates that a model trained on Dialogue NLI can be used to improve the consistency of a dialogue model, and evaluates the method with human evaluation and with automatic metrics on a suite of evaluation sets designed to measure a dialoguemodel’s consistency. Expand
Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
TLDR
Topical-Chat is introduced, a knowledge-grounded humanhuman conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in opendomain conversational AI. Expand
Wizard of Wikipedia: Knowledge-Powered Conversational agents
TLDR
The best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while a new benchmark allows for measuring further improvements in this important research direction. Expand
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
TLDR
An automatic question answering (QA) based metric for faithfulness, FEQA, is proposed, which leverages recent advances in reading comprehension and has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark
TLDR
The Benchmark for Evaluation of Grounded INteraction (BEGIN) consists of 8113 dialogue turns generated by language-model-based dialogue systems, accompanied by humans annotations specifying the relationship between the system’s response and the background information. Expand
Improving Factual Consistency of Abstractive Summarization via Question Answering
TLDR
This paper presents an approach to address factual consistency in summarization, and proposes an efficient automatic evaluation metric to measure factual consistency and a novel learning algorithm that maximizes the proposed metric during model training. Expand
...
1
2
3
4
5
...