ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation

@article{Hu2019ParaBankMB,
  title={ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation},
  author={J. Edward Hu and Rachel Rudinger and Matt Post and Benjamin Van Durme},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.03644}
}
We present PARABANK, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of PARANMT (Wieting and Gimpel, 2018), we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase… 

Figures and Tables from this paper

Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity
TLDR
A simple paraphrase generation algorithm which discourages the production of n-grams that are present in the input and which produces paraphrases that better preserve meaning and are more gramatical, for the same level of lexical diversity.
Negative Lexically Constrained Decoding for Paraphrase Generation
TLDR
A neural model is proposed for paraphrase generation that first identifies words in the source sentence that should be paraphrased and then these words are paraphrasing by the negative lexically constrained decoding that avoids outputting these words as they are.
Neural Syntactic Preordering for Controlled Paraphrase Generation
TLDR
This work uses syntactic transformations to softly “reorder” the source sentence and guide the neural paraphrasing model, which retains the quality of the baseline approaches while giving a substantial increase in the diversity of the generated paraphrases.
Multilingual Whispers: Generating Paraphrases with Translation
TLDR
This paper compares translation-based paraphrase gathering using human, automatic, or hybrid techniques to monolingual paraphrasing by experts and non-experts, and gathers translations, paraphrases, and empirical human quality assessments of these approaches.
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
TLDR
This work generates multiple translation samples using beam search and chooses the most lexically diverse pair according to their sentence BLEU, and compares the generated corpus with the ParaBank2.
Exemplar-Controllable Paraphrasing and Translation using Bitext
TLDR
This work adapts models from prior work to be able to learn solely from bilingual text (bitext), and shows that their models learn to disentangle semantics and syntax in their latent representations, but still suffer from semantic drift.
BiSECT: Learning to Split and Rephrase Sentences with Bitexts
TLDR
A novel dataset and a new model for this ‘split and rephrase’ task, which contains higher quality training examples than the previous Split and Rephrase corpora, and shows that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.
Automatically Paraphrasing via Sentence Reconstruction and Round-trip Translation
TLDR
A novel framework for paraphrase generation that simultaneously decodes the output sentence using a pretrained wordset-to-sequence model and a round-trip translation model is proposed and used to augment the training data for machine translation to achieve substantial improvements.
Controllable Paraphrasing and Translation with a Syntactic Exemplar
TLDR
The authors' single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions, and analysis shows that their models learn to disentangle semantics and syntax in their latent representations.
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
TLDR
This work proposes the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation, and finds that the model conditioned on the source instead of the reference outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.
...
...

References

SHOWING 1-10 OF 35 REFERENCES
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
TLDR
This work uses ParaNMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs, to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.
Sentential Paraphrasing as Black-Box Machine Translation
TLDR
This work uses the Paraphrase Database for monolingual sentence rewriting and provides machine translation language packs: prepackaged, tuned models that can be downloaded and used to generate paraphrases on a standard Unix environment.
Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods
TLDR
A comprehensive and application-independent survey of data-driven phrasal and sentential paraphrase generation methods is conducted, while also conveying an appreciation for the importance and potential use of paraphrases in the field of NLP research.
PPDB: The Paraphrase Database
TLDR
The 1.0 release of the paraphrase database, PPDB, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140million paraphrase patterns, which capture many meaning-preserving syntactic transformations.
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
TLDR
This work presents a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints and demonstrates the algorithm’s remarkable ability to properly place constraints, and uses it to explore the shaky relationship between model and BLEU scores.
Extracting Paraphrases from a Parallel Corpus
TLDR
This work presents an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text that yields phrasal and single word lexical paraphrasing as well as syntactic paraphrase.
Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences
TLDR
A syntax-based algorithm that automatically builds Finite State Automata (word lattices) from semantically equivalent translation sets that are good representations of paraphrases and can predict the correctness of alternative semantic renderings, which may be used to evaluate the quality of translations.
Sockeye: A Toolkit for Neural Machine Translation
TLDR
This paper highlights Sockeye's features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English, and reports competitive BLEU scores across all three architectures.
Efficient Elicitation of Annotations for Human Evaluation of Machine Translation
TLDR
The experimental results show that TrueSkill outperforms other recently proposed models on accuracy, and also can significantly reduce the number of pairwise annotations that need to be collected by sampling non-uniformly from the space of system competitions.
PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification
TLDR
PPDB 2.0 includes a discriminatively re-ranked set of paraphrases that achieve a higher correlation with human judgments than PPDB 1.0's heuristic rankings.
...
...