Deliverable D5.1: Report on the State of the Art of Named Entity and Wsd
This paper describes how a preexisting Constraint Grammar based parser for Danish (DanGram, Bick 2002) has been adapted and semantically enhanced in order to accommodate for named entity recognition (NER), using rule based and lexical, rather than probabilistic methodology. The project is part of a multi-lingual Nordic initiative, Nomen Nescio, which targets 6 primary name types (human, organisation, place, event, title/semantic product and brand/object). Training data, examples and statistical text data specifics were taken from the Korpus90/2000 annotation initiative (Bick 2003-1). The NER task is addressed following the progressive multi-level parsing architecture of DanGram, delegating different NER-subtasks to different specialised levels. Thus named entities are successively treated as first strings, words, types, and then as contextual units at the morphological, syntactic and semantic levels, consecutively. While lower levels mainly use pattern matching tools, the higher levels make increasing use of context based Constraint Grammar rules on the one hand, and lexical information, both morphological and semantic, on the other hand. Levels are implemented as a sequential chain of Perl-programs and CG-grammars. Two evaluation runs on Korpus90/2000 data showed about 2% chunking errors and false positive/false negative proper noun readings (originating at the lower levels), while the NER-typer as such had a 5% error rate with 0.1 0.5% remaining ambiguity, if measured only for correctly chunked proper nouns.