ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation

  title={ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation},
  author={J. Edward Hu and Rachel Rudinger and Matt Post and Benjamin Van Durme},
We present PARABANK, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of PARANMT (Wieting and Gimpel, 2018), we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase… 

Figures and Tables from this paper

Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity

A simple paraphrase generation algorithm which discourages the production of n-grams that are present in the input and which produces paraphrases that better preserve meaning and are more gramatical, for the same level of lexical diversity.

Negative Lexically Constrained Decoding for Paraphrase Generation

A neural model is proposed for paraphrase generation that first identifies words in the source sentence that should be paraphrased and then these words are paraphrasing by the negative lexically constrained decoding that avoids outputting these words as they are.

Neural Syntactic Preordering for Controlled Paraphrase Generation

This work uses syntactic transformations to softly “reorder” the source sentence and guide the neural paraphrasing model, which retains the quality of the baseline approaches while giving a substantial increase in the diversity of the generated paraphrases.

Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

ParaBank 2 is described, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering, showing that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving.

Multilingual Whispers: Generating Paraphrases with Translation

This paper compares translation-based paraphrase gathering using human, automatic, or hybrid techniques to monolingual paraphrasing by experts and non-experts, and gathers translations, paraphrases, and empirical human quality assessments of these approaches.

ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair

This work generates multiple translation samples using beam search and chooses the most lexically diverse pair according to their sentence BLEU, and compares the generated corpus with the ParaBank2.

BiSECT: Learning to Split and Rephrase Sentences with Bitexts

A novel dataset and a new model for this ‘split and rephrase’ task, which contains higher quality training examples than the previous Split and Rephrase corpora, and shows that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.

Automatically Paraphrasing via Sentence Reconstruction and Round-trip Translation

A novel framework for paraphrase generation that simultaneously decodes the output sentence using a pretrained wordset-to-sequence model and a round-trip translation model is proposed and used to augment the training data for machine translation to achieve substantial improvements.

Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing

This work proposes the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation, and finds that the model conditioned on the source instead of the reference outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.

Paraphrase Generation as Unsupervised Machine Translation

In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences



ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

This work uses ParaNMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs, to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.

Sentential Paraphrasing as Black-Box Machine Translation

This work uses the Paraphrase Database for monolingual sentence rewriting and provides machine translation language packs: prepackaged, tuned models that can be downloaded and used to generate paraphrases on a standard Unix environment.

Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods

A comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods is conducted, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research.

PPDB: The Paraphrase Database

The 1.0 release of the paraphrase database, PPDB, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140million paraphrase patterns, which capture many meaning-preserving syntactic transformations.

Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation

This work presents a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints and demonstrates the algorithm’s remarkable ability to properly place constraints, and uses it to explore the shaky relationship between model and BLEU scores.

Extracting Paraphrases from a Parallel Corpus

This work presents an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text that yields phrasal and single word lexical paraphrasing as well as syntactic paraphrase.

Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences

A syntax-based algorithm that automatically builds Finite State Automata (word lattices) from semantically equivalent translation sets that are good representations of paraphrases and can predict the correctness of alternative semantic renderings, which may be used to evaluate the quality of translations.

Sockeye: A Toolkit for Neural Machine Translation

This paper highlights Sockeye's features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English, and reports competitive BLEU scores across all three architectures.

Efficient Elicitation of Annotations for Human Evaluation of Machine Translation

The experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.

PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification

PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0's heuristic rankings.