Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data


Part-of-speech information is a prerequisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre. Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.

Extracted Key Phrases

Showing 1-10 of 34 references

The super tweets of #sb47. tweets-of-sb47

  • O Ashtari
  • 2013
1 Excerpt

TwitIE: A Fully-featured Information Extraction Pipeline for Microblog Text

  • K Bontcheva, L Derczynski, A Funk, M A Greenwood, D Maynard, N Aswani
  • 2013
1 Excerpt

Report: Twitter hits half a billion tweets a day. 3- 57541566-93/report-twitter-hits-half-a-billion- tweets-a-day

  • D Terdiman
  • 2012
1 Excerpt
Showing 1-10 of 71 extracted citations
Citations per Year

106 Citations

Semantic Scholar estimates that this publication has received between 76 and 154 citations based on the available data.

See our FAQ for additional information.