Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

@inproceedings{Grusky2018NewsroomAD,
  title={Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies},
  author={Max Grusky and Mor Naaman and Yoav Artzi},
  booktitle={NAACL},
  year={2018}
}
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. [] Key Method We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges. The dataset is available online at summari.es.

Figures and Tables from this paper

BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
TLDR
This work presents a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries, which has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content is evenly distributed in the input, and iii) lesser and shorter extractive fragments are present in the summaries.
Automatically Discarding Straplines to Improve Data Quality for Abstractive News Summarization
TLDR
Automatic evaluation indicates that removing straplines and noise from the training data of a news summarizer results in higher quality summaries, with improvements as high as 7 points ROUGE score.
Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
TLDR
This work introduces Multi-News, the first large-scale MDS news dataset, and proposes an end-to-end model which incorporates a traditional extractive summarization model with a standard SDS model and achieves competitive results on MDS datasets.
WikiHow: A Large Scale Text Summarization Dataset
TLDR
This paper presents WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors that represent high diversity styles.
Using Statistical Weighting and Popularity Ranking for Extractive Summarization
TLDR
The most novel method involves a combination of the TextRank algorithm and a statistical weighted distribution and this system highlights which parts of the text are best for an extractive summarization.
MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization
TLDR
This paper presents a new dataset MIRANEWS, aims at generating a summary for a single document, and shows via data analysis that it’s not only the models which are to blame: more than 27% of facts mentioned in the gold summaries of MIRanEWS are better grounded on assisting documents than in the main source articles.
CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level
TLDR
A large-scale Chinese news summarization dataset CNewSum is presented, which consists of 304,307 documents and human-written summaries for the news feed and has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models.
A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
TLDR
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters, and provides a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.
WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation
TLDR
This work presents a dataset based on article summaries appearing on the WikiHow website, composed of how-to articles and coherent-paragraph summaries written in plain language, showing this dataset makes human evaluation significantly easier and thus, more effective.
Large-Scale Multi-Document Summarization with Information Extraction and Compression
TLDR
This work develops an abstractive summarization framework independent of labeled data for multiple heterogeneous documents and enhances an existing sentence fusion method with a uni-directional language model to prioritize fused sentences with higher sentence probability with the goal of increasing readability.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
Neural Summarization by Extracting Sentences and Words
TLDR
This work develops a general framework for single-document summarization composed of a hierarchical document encoder and an attention-based extractor that allows for different classes of summarization models which can extract sentences or words.
SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents
We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model for extractive summarization of documents and show that it achieves performance better than or comparable to
Classify or Select: Neural Architectures for Extractive Document Summarization
TLDR
Two novel and contrasting Recurrent Neural Network (RNN) based architectures for extractive summarization of documents are presented and the models under both architectures jointly capture the notions of salience and redundancy of sentences.
A Neural Attention Model for Abstractive Sentence Summarization
TLDR
This work proposes a fully data-driven approach to abstractive sentence summarization by utilizing a local attention-based model that generates each word of the summary conditioned on the input sentence.
Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
TLDR
This work proposes several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-to-word structure, and emitting words that are rare or unseen at training time.
Abstractive Document Summarization with a Graph-Based Attentional Neural Model
TLDR
A novel graph-based attention mechanism in the sequence-to-sequence framework to address the saliency factor of summarization, which has been overlooked by prior works and is competitive with state-of-the-art extractive methods.
The Effects of Human Variation in DUC Summarization Evaluation
TLDR
How the variation in human judgments does and does not affect the results and their interpretation of automatic text summarization systems’ output is examined.
Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?
TLDR
The validity of the evaluation method used in the Document Understanding Conference (DUC) is discussed and five different ROUGE metrics are evaluated: ROUge-N, RouGE-L, ROUGEW, R OUGE-S, and ROUAGE-SU included in the Rouge summarization evaluation package using data provided by DUC.
Improving the Estimation of Word Importance for News Multi-Document Summarization
TLDR
A supervised model for ranking word importance that incorporates a rich set of features is proposed that is superior to prior approaches for identifying words used in human summaries and shows that an extractive summarizer which includes the estimation of word importance results in summaries comparable with the state-of-the-art by automatic evaluation.
DUC 2005: Evaluation of Question-Focused Summarization Systems
TLDR
The evaluation shows that the best summarization systems have difficulty extracting relevant sentences in response to complex questions (as opposed to representative sentences that might be appropriate to a generic summary).
...
1
2
3
4
...