Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data


Part-of-speech information is a prerequisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre. Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.

Extracted Key Phrases

11 Figures and Tables

Showing 1-10 of 34 references

The super tweets of #sb47. tweets-of-sb47

  • O Ashtari
  • 2013
1 Excerpt

TwitIE: A Fully-featured Information Extraction Pipeline for Microblog Text

  • K Bontcheva, L Derczynski, A Funk, M A Greenwood, D Maynard, N Aswani
  • 2013
1 Excerpt

Report: Twitter hits half a billion tweets a day. 3- 57541566-93/report-twitter-hits-half-a-billion- tweets-a-day

  • D Terdiman
  • 2012
1 Excerpt


Citations per Year

116 Citations

Semantic Scholar estimates that this publication has received between 84 and 166 citations based on the available data.

See our FAQ for additional information.