Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

  title={Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors},
  author={Liyan Tang and Tanya Goyal and Alexander R. Fabbri and Philippe Laban and Jiacheng Xu and Semih Yahvuz and Wojciech Kryscinski and Justin F. Rousseau and Greg Durrett},
The propensity of abstractive summarization systems to make factual errors has been the subject of significant study, including work on models to detect factual errors and annotation of errors in current systems’ outputs. How-ever, the ever-evolving nature of summarization systems, error detectors, and annotated benchmarks make factuality evaluation a mov-ing target; it is hard to get a clear picture of how techniques compare. In this work, we collect labeled factuality errors from across nine… 

Figures and Tables from this paper

BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics

A benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written , minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unffaithful summary.

mFACE: Multilingual Summarization with Factual Consistency Evaluation

This work exploits factual consistency evaluation models to improve multilingual summarization and explores two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data datatering and controlled generation.

On Improving Summarization Factual Consistency from Natural Language Feedback

This work collects a high-quality dataset, DeFacto, containing human demonstrations and informational feedback in natural language consisting of corrective instructions, edited summaries, and explanations with respect to the factual consistency of the summary, and evaluates if models can automatically correct factual inconsistencies in generated summaries.

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

A modified summarization salience protocol, Atomic Content Units (ACUs), which relies onained semantic units and al-lows for high inter-annotator agreement is proposed, which has important implications for evaluating large language models (LLMs), as it shows that LLMs adjusted by human feedback may over-strained human evaluation.

Analyzing and Evaluating Faithfulness in Dialogue Summarization

This work first performs the fine-grained human analysis on the faithfulness of dialogue summaries and observes that over 35% of generated summaries are faithfully inconsistent respective the source dialogues, and presents a new model-level faithfulness evaluation method.

News Summarization and Evaluation in the Era of GPT-3

It is shown that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality, and both reference-based and reference-free automatic metrics cannot reliably evaluate zero-shot summaries.



CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization

It is found that the contrastive learning framework consistently produces more factual summaries than strong comparisons with post error correction, entailment-based reranking, and unlikelihood training, according to QA-based factuality evaluation.

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

A typology of factual errors is devised and used to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets, showing their correlation with human judgement as well as their specific strengths and weaknesses.

Factual Error Correction for Abstractive Summarization Models

This work proposes a post-editing corrector module that is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset.

What Have We Achieved on Text Summarization?

It is found that under similar settings, extractive summarizers are in general better than their abstractive counterparts thanks to strength in faithfulness and factual-consistency, and pre-training techniques, and in particular sequence-to-sequence pre- training, are highly effective for improving text summarization, with BART giving the best results.

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text.

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

This work revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).

Annotating and Modeling Fine-grained Factuality in Summarization

This work explores both synthetic and human-labeled data sources for training models to identify factual errors in summarization, and shows that the best factuality detection model enables training of more factual XSum summarization models by allowing us to identify non-factual tokens in the training data.

Evaluating the Factual Consistency of Abstractive Text Summarization

A weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking.

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

This work devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset and proposed a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT.

Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization

This work proposes a novel detection approach that separates factual from non-factual hallucinations of entities and uses this method as a reward signal to train a summarization system using an off-line reinforcement learning (RL) algorithm that can significantly improve the factuality of generated summaries while maintaining the level of abstractiveness.