The paper " Data structure in Slovak National Corpus and its external annotation " presents proposed system for data storage and annotation in Slovak National Corpus. Taking into account the desirability to ensure readability of the data in future times, the system is targeted towards platform independent, easily decipherable annotation format. The data in(More)
matematické symboly a iné) – Každý znak tvorí samostatný token. Napr. vo vete " Win98 mi nefunguje!!! " bude 8 tokenov (úvodzovky dolné, Win98, mi, nefunguje, výkričník, výkričník, výkričník, úvodzovky horné). Segmentácia je dôležitou etapou v automatickom spracovaní textu, pretože od jej výsledkov je priamo závislá morfologická analýza a dezambiguácia.(More)
Morphological annotation constitutes essential, very useful and very common linguistic information presented in corpora, especially for highly inflectional languages. The morphological tagset used in the Slovak National Corpus has been designed with several goals in mind – the tags are compact and easily human-readable, without sacrificing their(More)
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress and(More)
Manual morphological text annotation is indisputably an important part of building a framework of NLP tools used in corpora construction. From 2004 to 2005, the complete text of Orwell's 1984 novel, some Slovak Wikipedia texts and some newspaper articles have been annotated. In the paper we present the methodology used in manual annotation and correction of(More)
Presented French-Slovak parallel corpus FRASK is a sizeable corpus consisting of European Union legislative texts and fiction in both French and Slovak languages. Texts are sentence-aligned, lemma-tized and contain morphological information. The searching mechanism includes the possibility to query single words, phrases, lemmas and morphology tag, using(More)
The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project be-The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural(More)
The article brings definitions of rules for automatic text segmentation and semi-automatic lemmatization and POS tagging of the texts in the Slovak National Corpus. The basic unit on the level of text segmentation is token, which is traditionally defined as a sequence of alphanumeric characters constrained by whitespace. On the level of morphological(More)