CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation

  title={CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation},
  author={Yuning Mao and Ming Zhong and Jiawei Han},
Scientific extreme summarization (TLDR) aims to form ultra-short summaries of scientific papers. Previous efforts on curating scientific TLDR datasets failed to scale up due to the heavy human annotation and domain ex-pertise required. In this paper, we propose a simple yet effective approach to automatically extracting TLDR summaries for scientific papers from their citation texts. Based on the proposed approach, we create a new benchmark CiteSum without human annotation, which is around 30 times… 

Figures and Tables from this paper


PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. However, these models are typically fine-tuned on
TLDR: Extreme Summarization of Scientific Documents
This work introduces SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers, and proposes CATTS, a simple yet effective learning strategy for generatingTLDRs that exploits titles as an auxiliary training signal.
ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks
The first large-scale manually-annotated corpus for scientific papers is developed and released by enabling faster annotation and summarization methods that integrate the authors’ original highlights and the article’s actual impacts on the community are proposed, to create comprehensive, hybrid summaries.
TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising
This work first leverage the lead bias in news articles to pretrain the model on millions of unlabeled corpora, and finetune TED on target domains through theme modeling and a denoising autoencoder to enhance the quality of generated summaries.
Overview of the CL-SciSumm 2016 Shared Task
This overview paper describes the participation and the official results of the second CL-SciSumm Shared Task, organized as a part of the Joint Workshop onBibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2016), held in New Jersey,USA in June, 2016.
Leveraging Lead Bias for Zero-shot Abstractive News Summarization
This work proposes a simple and effective way to pre-train abstractive news summarization models on large-scale unlabeled news corpora: predicting the leading sentences using the rest of an article using self-supervised pre-training.
Facet-Aware Evaluation for Extractive Summarization
This paper demonstrates that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods.
Sentence Centrality Revisited for Unsupervised Summarization
An unsupervised approach arguing that it is unrealistic to expect large-scale and high-quality training data to be available or created for different types of summaries, domains, or languages is developed.
Extractive Summarization as Text Matching
This paper forms the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be matched in a semantic space to create a semantic matching framework.