SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization

  title={SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization},
  author={Jesse Vig and Wojciech Kryscinski and Karan Goel and Nazneen Rajani},
Novel neural architectures, training strategies, and the availability of large-scale corpora haven been the driving force behind recent progress in abstractive text summarization. However, due to the black-box nature of neural models, uninformative evaluation metrics, and scarce tooling for model and data analysis the true performance and failure modes of summarization models remain largely unknown. To address this limitation, we introduce SummVis, an open-source tool for visualizing… 

Figures from this paper

Summary Explorer: Visualizing the State of the Art in Text Summarization
This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems by compiling the outputs of 55 state-of-the-art single document summarization
SummerTime: Text Summarization Toolkit for Non-experts
SummerTime is a complete toolkit for text summarization, including various models, datasets, and evaluation metrics, for a full spectrum of summarization-related tasks, and integrates with libraries designed for NLP researchers, and enables users with easy-to-use APIs.
Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents
Summ^N is the first multi-stage split-then-summarize framework for long input summarization and outperforms previous state-of-the-art methods by improving ROUGE scores on three long meeting summarization datasets AMI, ICSI, and QMSum.
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
This work revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).
MoFE: Mixture of Factual Experts for Controlling Hallucinations in Abstractive Summarization
The Mixture of Factual Experts (MoFE) model, which combines multiple summarization strategies that each target a specific type of factual error, provides a modular approach to control different factual errors while maintaining performance on Rouge metrics.
Seven challenges for harmonizing explainability requirements
It is argued that based on the current understanding of the field, the use of XAI techniques in practice necessitate a highly contextualized approach considering the specific needs of stakeholders for particular business applications.


The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models
The Language Interpretability Tool (LIT), an open-source platform for visualization and understanding of NLP models, is presented, which integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis.
Neural Text Summarization: A Critical Evaluation
This work critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlights three primary shortcomings: automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation.
Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization
While position exhibits substantial bias in news articles, this is not the case, for example, with academic papers and meeting minutes, and the empirical study shows that different types of summarization systems are composed of different degrees of the sub-aspects.
On Faithfulness and Factuality in Abstractive Summarization
It is found that neural abstractive summarization models are highly prone to hallucinate content that is unfaithful to the input document and textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.
Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization
A novel abstractive model is proposed which is conditioned on the article’s topics and based entirely on convolutional neural networks, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans.
Evaluating the Factual Consistency of Abstractive Text Summarization
A weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Content Selection in Deep Learning Models of Summarization
It is suggested that it is easier to create a summarizer for a new domain than previous work suggests and the benefit of deep learning models for summarization for those domains that do have massive datasets is brought into question.