Transforming Wikipedia into Augmented Data for Query-Focused Summarization

@article{Zhu2022TransformingWI,
  title={Transforming Wikipedia into Augmented Data for Query-Focused Summarization},
  author={Haichao Zhu and Li Dong and Furu Wei and Bing Qin and Ting Liu},
  journal={ArXiv},
  year={2022},
  volume={abs/1911.03324}
}
The manual construction of a query-focused summarization corpus is costly and timeconsuming. The limited size of existing datasets renders training data-driven summarization models challenging. In this paper, we use Wikipedia to automatically collect a large query-focused summarization dataset (named as WIKIREF) of more than 280,000 examples, which can serve as a means of data augmentation. Moreover, we develop a query-focused summarization model based on BERT to extract summaries from the… 

Figures and Tables from this paper

Document Summarization with Latent Queries
TLDR
This framework formulates summarization as a generative process, and jointly optimizes a latent query model and a conditional language model, and outperforms strong comparison systems across benchmarks, query types, document settings, and target domains.
Text Summarization with Latent Queries
TLDR
LAQSUM is the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms under a deep generative framework, allowing users to plug-and-play queries of any type at test time.
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. However, these models are typically fine-tuned on
Harvesting and Refining Question-Answer Pairs for Unsupervised QA
TLDR
This work introduces two approaches to improve unsupervised QA, which harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs and takes advantage of the QA model to extract more appropriate answers.
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization
TLDR
This work defines a new query-based multi-domain meeting summarization task, where models have to select and summarize relevant spans of meetings in response to a query, and introduces QMSum, a new benchmark for this task.
Submodular Span, with Applications to Conditional Data Summarization
TLDR
A two-stage Submodular Span Summarization (S3) framework to achieve a form of conditional or query-focused data summarization that matches or improves over the previous state-of-the-art.
A Survey of Data Augmentation Approaches for NLP
TLDR
This paper introduces and motivate data augmentation for NLP, and then discusses major methodologically representative approaches, and highlights techniques that are used for popular NLP applications and tasks.
Surfer100: Generating Surveys From Web Resources, Wikipedia-style
TLDR
This study shows that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation, and extends this approach to generate longer Wikipedia-style summaries with sections.
Intelligent Data Engineering and Automated Learning – IDEAL 2020: 21st International Conference, Guimaraes, Portugal, November 4–6, 2020, Proceedings, Part I
TLDR
A prostate gland segmentation based on U-Net convolutional neural network architectures modified with residual and multi-resolution blocks, trained using data augmentation techniques, which outperforms the previous state-of-the-art approaches in an image-level comparison.
Surfer100: Generating Surveys From Web Resources on Wikipedia-style
TLDR
It is shown that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation and how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
...
...

References

SHOWING 1-10 OF 21 REFERENCES
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
TLDR
A novel summarization system called AttSum is proposed, which automatically learns distributed representations for sentences as well as the document cluster and applies the attention mechanism to simulate the attentive reading of human behavior when a query is given.
Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models
TLDR
The method (Relevance Sensitive Attention for QFS) is compared to extractive baselines and with various ways to combine abstractive models on the DUC QFS datasets and with solid improvements on ROUGE performance.
Unsupervised Query-Focused Multi-Document Summarization using the Cross Entropy Method
TLDR
A novel unsupervised query-focused multi-document summarization approach that generates a summary by extracting a subset of sentences using the Cross-Entropy (CE) Method is presented.
Applying regression models to query-focused multi-document summarization
The use of MMR, diversity-based reranking for reordering documents and producing summaries
TLDR
This paper presents a method for combining query-relevance with information-novelty in the context of text retrieval and summarization, and preliminary results indicate some benefits for MMR diversity ranking in document retrieval and in single document summarization.
Neural Document Summarization by Jointly Learning to Score and Select Sentences
TLDR
This paper presents a novel end-to-end neural network framework for extractive document summarization by jointly learning to score and select sentences, which significantly outperforms the state-of-the-art extractive summarization models.
Using Supervised Bigram-based ILP for Extractive Summarization
TLDR
A bigram based supervised method for extractive document summarization in the integer linear programming (ILP) framework that consistently outperforms the previous ILP method on different TAC data sets, and performs competitively compared to the best results in the TAC evaluations.
The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries
TLDR
A method for combining query-relevance with information-novelty in the context of text retrieval and summarization and preliminary results indicate some benefits for MMR diversity ranking in document retrieval and in single document summarization.
Overview of DUC 2005
The focus of DUC 2005 was on developing new evaluation methods that take into account variation in content in human-authored summaries. Therefore, DUC 2005 had a single user-oriented,
Fine-tune BERT for Extractive Summarization
TLDR
BERTSUM, a simple variant of BERT, for extractive summarization, is described, which is the state of the art on the CNN/Dailymail dataset, outperforming the previous best-performed system by 1.65 on ROUGE-L.
...
...