Corpus ID: 233481654

Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark

  title={Evaluating Groundedness in Dialogue Systems: The BEGIN Benchmark},
  author={Nouha Dziri and Hannah Rashkin and Tal Linzen and D. Reitter},
Knowledge-grounded dialogue agents are systems designed to conduct a conversation based on externally provided background information, such as a Wikipedia page. Such dialogue agents, especially those based on neural network language models, often produce responses that sound fluent but are not justified by the background information. Progress towards addressing this problem requires developing automatic evaluation metrics that can quantify the extent to which responses are grounded in… Expand

Figures and Tables from this paper

DialFact: A Benchmark for Fact-Checking in Dialogue
DIALFACT, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia, is constructed and a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue is proposed. Expand
Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding
This paper proposes NEURAL PATH HUNTER which follows a generate-then-refine strategy whereby a generated response is amended using the k-hop subgraph of a Knowledge Graph (KG) and reports a relative improvement of faithfulness over GPT2 dialogue responses by 8.4%. Expand
Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
This work proposes an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models using automatic question generation and question answering, and makes use of co-reference resolution and natural language inference capabilities which greatly improve its performance. Expand


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Dialogue Natural Language Inference
This paper proposes a method which demonstrates that a model trained on Dialogue NLI can be used to improve the consistency of a dialogue model, and evaluates the method with human evaluation and with automatic metrics on a suite of evaluation sets designed to measure a dialoguemodel’s consistency. Expand
Wizard of Wikipedia: Knowledge-Powered Conversational agents
The best performing dialogue models are able to conduct knowledgeable discussions on open-domain topics as evaluated by automatic metrics and human evaluations, while a new benchmark allows for measuring further improvements in this important research direction. Expand
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus. Expand
Personalizing Dialogue Agents: I have a dog, do you have pets too?
This work collects data and train models tocondition on their given profile information; and information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Recipes for Building an Open-Domain Chatbot
Human evaluations show the best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements, and the limitations of this work are discussed by analyzing failure cases of the models. Expand
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text. Expand
DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation
It is shown that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. Expand
Evaluating the Factual Consistency of Abstractive Text Summarization
A weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking. Expand