Corpus ID: 21703865

Auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus

@inproceedings{Zopf2018AutohMDSAC,
  title={Auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus},
  author={Markus Zopf},
  booktitle={LREC},
  year={2018}
}
Automatic text summarization is a challenging natural language processing (NLP) task which has been researched for several decades. The available datasets for multi-document summarization (MDS) are, however, rather small and usually focused on the newswire genre. Nowadays, machine learning methods are applied to more and more NLP problems such as machine translation, question answering, and single-document summarization. Modern machine learning methods such as neural networks require large… Expand
Summarization Beyond News: The Automatically Acquired Fandom Corpora
TLDR
A novel automatic corpus construction approach to tackle the issue of large state-of-the-art corpora for training neural networks to create abstractive summaries as well as three new large open-licensed summarization corpora based on this approach that can be used for training abstractive summarization models. Expand
DynE: Dynamic Ensemble Decoding for Multi-Document Summarization
TLDR
This work proposes a simple decoding methodology which ensembles the output of multiple instances of the same model on different inputs, and obtains state-of-the-art results on several multi-document summarization datasets. Expand
Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
TLDR
This work introduces Multi-News, the first large-scale MDS news dataset, and proposes an end-to-end model which incorporates a traditional extractive summarization model with a standard SDS model and achieves competitive results on MDS datasets. Expand
A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
TLDR
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters, and provides a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques. Expand
What's Important in a Text? An Extensive Evaluation of Linguistic Annotations for Summarization
TLDR
This paper extends a previously presented summarization system by replacing bigrams with a multitude of different linguistic annotation types, including n-grams, verb stems, frames, concepts, chunks, connotation frames, entity types, and discourse relation sense-types, and proposes two novel evaluation methods to evaluate information importance detection capabilities. Expand
Error Analysis of using BART for Multi-Document Summarization: A Study for English and German Language
TLDR
An in-depth error analysis is performed of the followed approach for both languages, which leads to identifying most notable errors, from made-up facts and topic delimitation, and quantifying the amount of extractiveness. Expand
Principled Approaches to Automatic Text Summarization
TLDR
It is demonstrated that General Purpose Optimization (GPO) techniques like genetic algorithms are practical and do not require mathematical properties from the objective function and, thus, the summary scoring function can be relieved from its previously imposed constraints. Expand
Massive Multi-Document Summarization of Product Reviews with Weak Supervision
TLDR
This work proposes a schema for summarizing a massive set of reviews on top of a standard summarization algorithm and shows that an initial implementation of the schema significantly improves over several baselines in ROUGE scores, and exhibits strong coherence in a manual linguistic quality assessment. Expand
WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections
TLDR
Qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for the WIKITABLET dataset to inspire future work on long-form generation. Expand
Which Scores to Predict in Sentence Regression for Text Summarization?
TLDR
It is shown in extensive experiments that following intuition leads to suboptimal results and that learning to predict ROUGE precision scores leads to better results. Expand
...
1
2
...

References

SHOWING 1-10 OF 21 REFERENCES
The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach
TLDR
The new hMDS corpus for multi-document summarization is presented, which contains heterogeneous source documents from multiple text genres, as well as summaries with different lengths, and a novel construction approach which is suited to build large and heterogeneous summarization corpora with little effort. Expand
A Redundancy-Aware Sentence Regression Framework for Extractive Summarization
TLDR
A new framework to conduct regression with respect to the relative gain of s given S calculated by the ROUGE metric is presented and outperforms state-of-the-art extractive summarization approaches. Expand
A broad-coverage collection of portable NLP components for building shareable analysis pipelines
Due to the diversity of natural language processing (NLP) tools and resources, combining them into processing pipelines is an important issue, and sharing these pipelines with others remains aExpand
Abstractive Sentence Summarization with Attentive Recurrent Neural Networks
TLDR
A conditional recurrent neural network (RNN) which generates a summary of an input sentence which significantly outperforms the recently proposed state-of-the-art method on the Gigaword corpus while performing competitively on the DUC-2004 shared task. Expand
Abstractive Document Summarization with a Graph-Based Attentional Neural Model
TLDR
A novel graph-based attention mechanism in the sequence-to-sequence framework to address the saliency factor of summarization, which has been overlooked by prior works and is competitive with state-of-the-art extractive methods. Expand
Learning Summary Prior Representation for Extractive Summarization
TLDR
A novel summary system called PriorSum is developed, which applies the enhanced convolutional neural networks to capture the summary prior features derived from length-variable phrases under a regression framework, and concatenated with document-dependent features for sentence ranking. Expand
Get To The Point: Summarization with Pointer-Generator Networks
TLDR
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Expand
SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents
We present SummaRuNNer, a Recurrent Neural Network (RNN) based sequence model for extractive summarization of documents and show that it achieves performance better than or comparable toExpand
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
TLDR
A new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences is considered and the LexRank with threshold method outperforms the other degree-based techniques including continuous LexRank. Expand
Beyond Centrality and Structural Features: Learning Information Importance for Text Summarization
TLDR
This paper argues that centrality and structural features cannot reliably detect important information in heterogeneous document collections and proposes CPSum, a summarizer that learns the importance of information objects from a background source. Expand
...
1
2
3
...