Heiki-Jaan Kaalep

Learn More
The EU Copernicus project Multext-East has created a multilingual corpus of text and speech data, covering the six languages of the project: lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwell's Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to(More)
The paper describes a morphological analyser for Estonian and how using a text corpus influenced the process of creating it and the resulting program itself. The influence is not limited with the lexicon only, but is noticeable in the resulting algorithm and implementation too. When work on the analyser started, there was no computational treatment of(More)
This paper describes automatic treatment of multi-word expressions in a morphologically complex flective language – Estonian. It focuses on a special type of multi-word expressions – the verbal multi-word expressions that can function as predicates. Authors describe two language resources – a database of verbal multi-word expressions and a corpus where(More)
This article introduces a corpus-based method for improving the process of automatic morphological analysis of a non-standard text variety. More precisely, our paper is concerned with the morphological analysis of Estonian chatroom texts. First, the morphological analyzer designed for the standard written Estonian is used for the analysis of chatroom texts.(More)
This work introduces a method and tool for handling overlapping parallel corpora – i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations(More)