Data Set Used
The need for lemmatization in inflectionally rich languages is indisputable: it is applicable for the whole range of procedures — from text-search, up to parsing. From two predominant approaches to lemmatization: 1) algorithmic (generally rule-based and realized with FSA) and 2) relational (generally data-driven and realized with databases), this paper… (More)
This paper describes results of the first successful effort in applying a stochastic strategy – or, namely, a second order Markov model paradigm implemented by the TnT trigram tagger – to morphosyntactic tagging of Croatian texts. Beside the tagger, for purposes of both training and testing, we had at our disposal only a 100 Kw Croatia Weekly newspaper… (More)
This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our… (More)
The paper presents the work-in-progress of building the Croatian Dependency Treebank. Its design principles, procedures and the pilot corpus used within are described. Perspectives for further development of the Croatian Dependency Tree-bank are presented at the end.
The paper presents results of an experiment dealing with sentiment analysis of Croatian text from the domain of finance. The goal of the experiment was to design a system model for automatic detection of general sentiment and polarity phrases in these texts. We have assembled a document collection from web sources writing on the financial market in Croatia… (More)
We present the current state of development of the Croatian Dependency Treebank – with special empahsis on adapting the Prague Dependency Treebank formalism to Croatian language specifics – and illustrate its possible applications in an experiment with dependency parsing using MaltParser. The treebank currently contains approximately 2870 sentences, out of… (More)
The contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected in the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for… (More)