• Corpus ID: 215754200

A Divide-and-Conquer Approach to the Summarization of Academic Articles

@article{Gidiotis2020ADA,
  title={A Divide-and-Conquer Approach to the Summarization of Academic Articles},
  author={Alexios Gidiotis and Grigorios Tsoumakas},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.06190}
}
We present a novel divide-and-conquer method for the summarization of long documents. Our method processes the input in parts and generates a corresponding summary. These partial summaries are then combined in order to produce a final complete summary. Splitting the problem of long document summarization into smaller and simpler problems, reduces the computational complexity of the summarization process and leads to more training examples that at the same time contain less noise in the target… 

Figures and Tables from this paper

TOPIC-FOCUSED EXTRACTIVE SUMMARIZATION
TLDR
A new BERT-based neural model is proposed to learn the task of topic-focused extractive summarization, and a system which can generate topic- focused summaries for unseen documents in some domain as per the user’s requirements after being trained on a small number of document-summary pairs per domain is built.
On Generating Extended Summaries of Long Documents
TLDR
This paper exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach and shows that the multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences across diverse sections.
Long-Span Dependencies in Transformer-based Summarization Systems
TLDR
This work exploits large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection and can achieve comparable or better results than existing approaches.
Human-Centered Financial Summarization
TLDR
This thesis proposes a financial human-centered summarization model, utilizing the implementation of an existing state-of-the-art transformer model: PEGASUS, and contributes a new dataset, which consists explicitly of financial newswire articles from economic and business categories via Bloomberg’s website.
CIST@CL-SciSumm 2020, LongSumm 2020: Automatic Scientific Document Summarization
TLDR
This work applies more machine learning methods on position features and content features for facet classification in Task1B and introduces GCN in Task2 to perform extractive summarization of CL-SciSumm 2020.
Overview and Insights from the Shared Tasks at Scholarly Document Processing 2020: CL-SciSumm, LaySumm and LongSumm
TLDR
The quality and quantity of the submissions show that there is ample interest in scholarly document summarization, and the state of the art in this domain is at a midway point between being an impossible task and one that is fully resolved.
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information
TLDR
This work proposes a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model based on a pre-trained, encoder-only Transformer language model (HiStruct+ model), which improves SOTA ROUGEs for extractives summarization on PubMed and arXiv substantially.
Longformer: The Long-Document Transformer
TLDR
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
Big Bird: Transformers for Longer Sequences
TLDR
It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
We Can Explain Your Research in Layman's Terms: Towards Automating Science Journalism at Scale
TLDR
This work creates a specialized dataset that contains scientific papers and their Science Daily press releases, and demonstrates numerous sequence to sequence (seq2seq) applications using Science Daily with the aim of facilitating further research on language generation.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
A Supervised Approach to Extractive Summarisation of Scientific Papers
TLDR
This paper introduces a new dataset for summarisation of computer science publications by exploiting a large resource of author provided summaries and develops models on the dataset making use of both neural sentence encoding and traditionally used summarisation features.
Bottom-Up Abstractive Summarization
TLDR
This work explores the use of data-efficient content selectors to over-determine phrases in a source document that should be part of the summary, and shows that this approach improves the ability to compress text, while still generating fluent summaries.
Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
TLDR
This work proposes several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-to-word structure, and emitting words that are rare or unseen at training time.
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
TLDR
This work proposes the first model for abstractive summarization of single, longer-form documents (e.g., research papers), consisting of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary.
Get To The Point: Summarization with Pointer-Generator Networks
TLDR
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
A Neural Attention Model for Abstractive Sentence Summarization
TLDR
This work proposes a fully data-driven approach to abstractive sentence summarization by utilizing a local attention-based model that generates each word of the summary conditioned on the input sentence.
Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
TLDR
The NEWSROOM dataset is presented, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications between 1998 and 2017, and the summaries combine abstractive and extractive strategies.
Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion
Text Summarization with Pretrained Encoders
TLDR
This paper introduces a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences and proposes a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two.
Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting
TLDR
An accurate and fast summarization model that first selects salient sentences and then rewrites them abstractively to generate a concise overall summary is proposed, which achieves the new state-of-the-art on all metrics on the CNN/Daily Mail dataset, as well as significantly higher abstractiveness scores.
...
...