ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation
@article{Hu2019ParaBankMB, title={ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation}, author={J. Edward Hu and Rachel Rudinger and Matt Post and Benjamin Van Durme}, journal={ArXiv}, year={2019}, volume={abs/1901.03644} }
We present PARABANK, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of PARANMT (Wieting and Gimpel, 2018), we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase…
49 Citations
Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity
- Computer ScienceWMT
- 2020
A simple paraphrase generation algorithm which discourages the production of n-grams that are present in the input and which produces paraphrases that better preserve meaning and are more gramatical, for the same level of lexical diversity.
Negative Lexically Constrained Decoding for Paraphrase Generation
- Computer ScienceACL
- 2019
A neural model is proposed for paraphrase generation that first identifies words in the source sentence that should be paraphrased and then these words are paraphrasing by the negative lexically constrained decoding that avoids outputting these words as they are.
Neural Syntactic Preordering for Controlled Paraphrase Generation
- Computer ScienceACL
- 2020
This work uses syntactic transformations to softly “reorder” the source sentence and guide the neural paraphrasing model, which retains the quality of the baseline approaches while giving a substantial increase in the diversity of the generated paraphrases.
Multilingual Whispers: Generating Paraphrases with Translation
- Computer ScienceEMNLP
- 2019
This paper compares translation-based paraphrase gathering using human, automatic, or hybrid techniques to monolingual paraphrasing by experts and non-experts, and gathers translations, paraphrases, and empirical human quality assessments of these approaches.
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
- Computer SciencePACLIC
- 2021
This work generates multiple translation samples using beam search and chooses the most lexically diverse pair according to their sentence BLEU, and compares the generated corpus with the ParaBank2.
Exemplar-Controllable Paraphrasing and Translation using Bitext
- Computer Science
- 2020
This work adapts models from prior work to be able to learn solely from bilingual text (bitext), and shows that their models learn to disentangle semantics and syntax in their latent representations, but still suffer from semantic drift.
BiSECT: Learning to Split and Rephrase Sentences with Bitexts
- Computer ScienceEMNLP
- 2021
A novel dataset and a new model for this ‘split and rephrase’ task, which contains higher quality training examples than the previous Split and Rephrase corpora, and shows that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.
Automatically Paraphrasing via Sentence Reconstruction and Round-trip Translation
- Computer ScienceIJCAI
- 2021
A novel framework for paraphrase generation that simultaneously decodes the output sentence using a pretrained wordset-to-sequence model and a round-trip translation model is proposed and used to augment the training data for machine translation to achieve substantial improvements.
Controllable Paraphrasing and Translation with a Syntactic Exemplar
- Computer ScienceArXiv
- 2020
The authors' single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions, and analysis shows that their models learn to disentangle semantics and syntax in their latent representations.
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
- Computer ScienceEMNLP
- 2020
This work proposes the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation, and finds that the model conditioned on the source instead of the reference outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.
References
SHOWING 1-10 OF 35 REFERENCES
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
- Computer ScienceACL
- 2018
This work uses ParaNMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs, to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.
Sentential Paraphrasing as Black-Box Machine Translation
- Computer ScienceNAACL
- 2016
This work uses the Paraphrase Database for monolingual sentence rewriting and provides machine translation language packs: prepackaged, tuned models that can be downloaded and used to generate paraphrases on a standard Unix environment.
Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods
- Computer ScienceComputational Linguistics
- 2010
A comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods is conducted, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research.
PPDB: The Paraphrase Database
- Computer ScienceNAACL
- 2013
The 1.0 release of the paraphrase database, PPDB, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140million paraphrase patterns, which capture many meaning-preserving syntactic transformations.
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
- Computer ScienceNAACL
- 2018
This work presents a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints and demonstrates the algorithm’s remarkable ability to properly place constraints, and uses it to explore the shaky relationship between model and BLEU scores.
Extracting Paraphrases from a Parallel Corpus
- LinguisticsACL
- 2001
This work presents an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text that yields phrasal and single word lexical paraphrasing as well as syntactic paraphrase.
Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences
- Computer ScienceNAACL
- 2003
A syntax-based algorithm that automatically builds Finite State Automata (word lattices) from semantically equivalent translation sets that are good representations of paraphrases and can predict the correctness of alternative semantic renderings, which may be used to evaluate the quality of translations.
Sockeye: A Toolkit for Neural Machine Translation
- Computer ScienceArXiv
- 2017
This paper highlights Sockeye's features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English, and reports competitive BLEU scores across all three architectures.
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation
- Computer ScienceWMT@ACL
- 2014
The experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification
- Computer ScienceACL
- 2015
PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0's heuristic rankings.