• Corpus ID: 245502761

Measuring Attribution in Natural Language Generation Models

@article{Rashkin2021MeasuringAI,
  title={Measuring Attribution in Natural Language Generation Models},
  author={Hannah Rashkin and Vitaly Nikolaev and Matthew Lamm and Michael Collins and Dipanjan Das and Slav Petrov and Gaurav Singh Tomar and Iulia Turc and D. Reitter},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.12870}
}
With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a… 
LaMDA: Language Models for Dialog Applications
TLDR
It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.
TRUE: Re-evaluating Factual Consistency Evaluation
TLDR
This work introduces TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency, and finds that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
TLDR
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
FaithDial: A Faithful Benchmark for Information-Seeking Dialogue
TLDR
F AITH D IAL is created and can serve as training signal for a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 21.1 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence.
On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?
TLDR
This work conducts a comprehensive human study on both existing knowledge-grounded conversational benchmarks and several state-of-the-art models, revealing that the standard benchmarks consist of > 60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations.