Automatic normalization of short texts by combining statistical and rule-based techniques

  title={Automatic normalization of short texts by combining statistical and rule-based techniques},
  author={Marta R. Costa-Juss{\`a} and Rafael E. Banchs},
  journal={Language Resources and Evaluation},
Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a… CONTINUE READING

13 Figures & Tables

Extracted Numerical Results

  • In this first normalization step we obtained a BLEU of 76.03 over the development set and 72.11 over the test set.
  • This is mainly because the SMT step is not deterministic and it will behave differently whether the Perplexity Vocabulary Raw 110.7 18.9k input is the original raw text or it is the cleaned text resulting from applying the rules.
  • All in all, results show that the best configuration is the one that uses both, the statistical and the rule-based, correction methods in that order; in which case only a 6.23% reduction in perplexity is obtained.
  • In this last case a 7.05 % reduction in perplexity is achieved.
  • In this case a 6.23 % reduction in perplexity is obtained.