• Publications
  • Influence
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature, with an average size of nearly 9 million words per language. Expand
MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. Expand
Towards a Slovene Dependency Treebank
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Expand
Universal Dependencies 2.1
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-language learning, and parsing research from a language typology perspective. Expand
MULTEXT-East: morphosyntactic resources for Central and Eastern European languages
  • T. Erjavec
  • Computer Science
  • Lang. Resour. Evaluation
  • 1 March 2012
The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. Expand
Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene
In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia. Expand
Designing and Evaluating a Russian Tagset
This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. Expand
Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data
This release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. Expand
Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets
The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with a large number of possible word-class tags and only a small (hand-tagged) dataset. Expand
Datasets of Slovene and Croatian Moderated News Comments
This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries. Expand