Normalizing tweets with edit scripts and recurrent neural embeddings

Abstract

Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canoni-cal form. We propose a novel text nor-malization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embed-dings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normaliza-tion dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.

7 Figures and Tables