DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion

@inproceedings{Geva2019DiscoFuseAL,
  title={DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion},
  author={Mor Geva and Eric Malmi and Idan Szpektor and Jonathan Berant},
  booktitle={NAACL},
  year={2019}
}
Sentence fusion is the task of joining several independent sentences into a single coherent text. [...] Key Method We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation…Expand
Semantically Driven Sentence Fusion: Modeling and Evaluation
TLDR
This work presents an approach in which ground-truth solutions are automatically expanded into multiple references via curated equivalence classes of connective phrases, and applies this method to a large-scale dataset and uses the augmented dataset for both model training and evaluation. Expand
Understanding Points of Correspondence between Sentences for Abstractive Summarization
TLDR
This paper presents an investigation into fusing sentences drawn from a document by introducing the notion of points of correspondence, which are cohesive devices that tie any two sentences together into a coherent text. Expand
Automatic Fact-guided Sentence Modification
TLDR
This paper proposes a two-step solution to rewriting dynamically changing articles in encyclopediae, and demonstrates that generating synthetic data through such rewritten sentences can successfully augment the FEVER fact-checking training dataset, leading to a relative error reduction of 13%. Expand
A Cascade Approach to Neural Abstractive Summarization with Content Selection and Fusion
TLDR
Empirical results are presented showing that the performance of a cascaded pipeline that separately identifies important content pieces and stitches them together into a coherent text is comparable to or outranks that of end-to-end systems, whereas a pipeline architecture allows for flexible content selection. Expand
Encode, Tag, Realize: High-Precision Text Editing
TLDR
LaserTagger is proposed - a sequence tagging approach that casts text generation as a text editing task, and it is shown that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment. Expand
An Entity-Driven Framework for Abstractive Summarization
TLDR
SENECA is introduced, a novel System for ENtity-drivEn Coherent Abstractive summarization framework that leverages entity information to generate informative and coherent abstracts and significantly outperforms previous state-of-the-art based on ROUGE and proposed coherence measures on New York Times and CNN/Daily Mail datasets. Expand
Analyzing Sentence Fusion in Abstractive Summarization
TLDR
This paper analyzes the outputs of five state-of-the-art abstractive summarizers, focusing on summary sentences that are formed by sentence fusion, and reveals that system sentences are mostly grammatical, but often fail to remain faithful to the original article. Expand
SuperPAL: Supervised Proposition ALignment for Multi-Document Summarization and Derivative Sub-Tasks
TLDR
An annotation methodology is presented by which to create gold standard development and test sets for summary-source alignment, and its utility for tuning and evaluating effective alignment algorithms, as well as for properly evaluating MDS subtasks is suggested. Expand
Learning to Fuse Sentences with Transformers for Summarization
TLDR
The ability of Transformers to fuse sentences is explored and novel algorithms to enhance their ability to perform sentence fusion by leveraging the knowledge of points of correspondence between sentences are proposed. Expand
Unsupervised Text Style Transfer with Masked Language Models
TLDR
The experiments on sentence fusion and sentiment transfer demonstrate that Masker performs competitively in a fully unsupervised setting, and in low-resource settings, it improves supervised methods' accuracy by over 10 percentage points when pre-training them on silver training data generated by Masker. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 28 REFERENCES
Learning to Fuse Disparate Sentences
TLDR
Evaluation by human judges shows that the system for fusing sentences which are drawn from the same source document but have different content produces fused sentences that are both informative and readable. Expand
Time-Efficient Creation of an Accurate Sentence Fusion Corpus
TLDR
This paper presents a methodology for collecting fusions of similar sentence pairs using Amazon's Mechanical Turk, selecting the input pairs in a semi-automated fashion and evaluates the results using a novel technique for automatically selecting a representative sentence from multiple responses. Expand
Supervised Sentence Fusion with Single-Stage Inference
TLDR
A new dataset of sentence fusion instances obtained from evaluation datasets in summarization shared tasks is presented and a proposed inference approach recovers the highest scoring output fusion under an n-gram factorization using a compact integer linear programming formulation that avoids cycles and disconnected structures. Expand
Sentence Fusion for Multidocument News Summarization
TLDR
This article introduces sentence fusion, a novel text-to-text generation technique for synthesizing common information across documents that moves the summarization field from the use of purely extractive methods to the generation of abstracts that contain sentences not found in any of the input documents and can synthesize information across sources. Expand
Split and Rephrase
TLDR
A new sentence simplification task (Split-and-Rephrase) where the aim is to split a complex sentence into a meaning preserving sequence of shorter sentences, which could be used as a preprocessing step which facilitates and improves the performance of parsers, semantic role labellers and machine translation systems. Expand
Automatic Prediction of Discourse Connectives
TLDR
The hardness of the task for human raters is evaluated, a recently proposed decomposable attention (DA) model is applied and under specific conditions the raters still outperform the DA model, suggesting that there is headroom for future improvements. Expand
Abstractive Multi-Document Summarization via Phrase Selection and Merging
TLDR
This work proposes an abstraction-based multi-document summarization framework that can construct new sentences by exploring more fine-grained syntactic units than sentences, namely, noun/verb phrases, and achieves reasonably well results on manual linguistic quality evaluation. Expand
Extraction Meets Abstraction: Ideal Answer Generation for Biomedical Questions
TLDR
This work incorporates a sentence fusion approach, based on Integer Linear Programming, along with three novel approaches for sentence ordering, in an attempt to improve the human readability of ideal answers of BioASQ challenge. Expand
Split and Rephrase: Better Evaluation and Stronger Baselines
TLDR
A new train-development-test data split and neural models augmented with a copy-mechanism are presented, outperforming the best reported baseline by 8.68 BLEU and fostering further progress on the task. Expand
Splitting complex sentences for natural language processing applications: Building a simplified Spanish corpus
TLDR
A new Spanish parallel corpus of original and syntactically simplified texts is presented to create an automatic syntactic simplification system to be used as a preprocessing tool for other Natural Language Processing tasks. Expand
...
1
2
3
...