Extracting Lexically Divergent Paraphrases from Twitter


We present MULTIP (Multi-instance Learning Paraphrase Model), a new model suited to identify paraphrases within the short messages on Twitter. We jointly model paraphrase relations between word and sentence pairs and assume only sentence-level annotations during learning. Using this principled latent variable model alone, we achieve the performance competitive with a state-of-the-art method which combines a latent space model with a feature-based supervised classifier. Our model also captures lexically divergent paraphrases that differ from yet complement previous methods; combining our model with previous work significantly outperforms the stateof-the-art. In addition, we present a novel annotation methodology that has allowed us to crowdsource a paraphrase corpus from Twitter. We make this new dataset available to the research community.

Extracted Key Phrases

11 Figures and Tables

Citations per Year

Citation Velocity: 17

Averaging 17 citations per year over the last 3 years.

Learn more about how we calculate this metric in our FAQ.

Cite this paper

@article{Xu2014ExtractingLD, title={Extracting Lexically Divergent Paraphrases from Twitter}, author={Wei Xu and Alan Ritter and Chris Callison-Burch and William B. Dolan and Yangfeng Ji}, journal={TACL}, year={2014}, volume={2}, pages={435-448} }