Simple Data-Driven Context-Sensitive Lemmatization


Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages.

5 Figures and Tables

Showing 1-10 of 18 references

Lluís Padró, and Muntsa Padró Freeling: An opensource suite of language analyzers

  • Xavier Carreras, Isaac Chao
  • 2004
Highly Influential
9 Excerpts

Building Cast3LB: A Spanish treebank

  • Montserrat Civit, Ma Antònia, Martí
  • 2004

Building Cat3LB: a treebank for Catalan

  • Civit, Núria Monsterrat, Pilar Bufí, Valverde
  • 2004

Wzbogacony korpus słownika frekwencyjnego polszczyzny współczesnej

  • Janusz S Bie´bie´n, Marcin Woli´woli´nski
  • 2003