Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

  title={Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors},
  author={Liyan Tang and Tanya Goyal and Alexander R. Fabbri and Philippe Laban and Jiacheng Xu and Semih Yahvuz and Wojciech Kryscinski and Justin F. Rousseau and Greg Durrett},
The propensity of abstractive summarization systems to make factual errors has been the subject of significant study, including work on models to detect factual errors and annotation of errors in current systems’ outputs. How-ever, the ever-evolving nature of summarization systems, error detectors, and annotated benchmarks make factuality evaluation a mov-ing target; it is hard to get a clear picture of how techniques compare. In this work, we collect labeled factuality errors from across nine… 

Figures and Tables from this paper

Analyzing and Evaluating Faithfulness in Dialogue Summarization

It is observed that over 35% of generated summaries are faithfully inconsistent re-spective the source dialogues and a new model-level faithfulness evaluation method is presented, which examines generation models with multi-choice questions created by rule-based transformations.

News Summarization and Evaluation in the Era of GPT-3

It is shown that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality, and both reference-based and reference-free automatic metrics cannot reliably evaluate zero-shot summaries.



CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization

It is found that the contrastive learning framework consistently produces more factual summaries than strong comparisons with post error correction, entailment-based reranking, and unlikelihood training, according to QA-based factuality evaluation.

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

A typology of factual errors is devised and used to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets, showing their correlation with human judgement as well as their specific strengths and weaknesses.

Factual Error Correction for Abstractive Summarization Models

This work proposes a post-editing corrector module that is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset.

What Have We Achieved on Text Summarization?

It is found that under similar settings, extractive summarizers are in general better than their abstractive counterparts thanks to strength in faithfulness and factual-consistency, and pre-training techniques, and in particular sequence-to-sequence pre- training, are highly effective for improving text summarization, with BART giving the best results.

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary, is proposed and is believed to be a promising tool in automatically generating usable and factually consistent text.

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

This work revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).

Annotating and Modeling Fine-grained Factuality in Summarization

This work explores both synthetic and human-labeled data sources for training models to identify factual errors in summarization, and shows that the best factuality detection model enables training of more factual XSum summarization models by allowing us to identify non-factual tokens in the training data.

Evaluating the Factual Consistency of Abstractive Text Summarization

A weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking.

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

This work devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset and proposed a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT.

Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization

This work proposes a novel detection approach that separates factual from non-factual hallucinations of entities and uses this method as a reward signal to train a summarization system using an off-line reinforcement learning (RL) algorithm that can significantly improve the factuality of generated summaries while maintaining the level of abstractiveness.