Extracting Lexically Divergent Paraphrases from Twitter

Abstract

We present MULTIP (Multi-instance Learning Paraphrase Model), a new model suited to identify paraphrases within the short messages on Twitter. We jointly model paraphrase relations between word and sentence pairs and assume only sentence-level annotations during learning. Using this principled latent variable model alone, we achieve the performance competitive with a state-of-the-art method which combines a latent space model with a feature-based supervised classifier. Our model also captures lexically divergent paraphrases that differ from yet complement previous methods; combining our model with previous work significantly outperforms the state-of-the-art. In addition, we present a novel annotation methodology that has allowed us to crowdsource a paraphrase corpus from Twit-ter. We make this new dataset available to the research community.

Extracted Key Phrases

11 Figures and Tables

Showing 1-10 of 40 extracted citations
01020302014201520162017
Citations per Year

51 Citations

Semantic Scholar estimates that this publication has received between 40 and 78 citations based on the available data.

See our FAQ for additional information.