• Publications
  • Influence
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages
A new, unique and freely available parallel corpus containing European Union documents of mostly legal nature, available in all 20 official EU languages, which is particularly suitable to carry out all types of cross-language research and to test and benchmark text analysis software across different languages.
Towards a Slovene Dependency Treebank
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Our approach to annotation is based on the Prague Dependency Treebank,
Universal Dependencies 2.1
The annotation scheme is based on (universal) Stanford dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets for morpho-lingual tagsets.
MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the third release of the MULTEXT-East language resources, which brings together the first two, makes them available in TEI P4 XML, and introduces further extensions, e.g., the specification for Resian, a dialect of Slovene.
MULTEXT-East: morphosyntactic resources for Central and Eastern European languages
  • T. Erjavec
  • Computer Science
    Lang. Resour. Evaluation
  • 1 March 2012
The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description, which is unique in terms of languages covered and the wealth of encoding.
Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene
The legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia are presented and an automatic identification and classification system is trained to contribute towards an improved methodology, understanding and treatment of such practices in the contemporary, increasingly multicultural information society.
Designing and Evaluating a Russian Tagset
The principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation are reported, which achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus.
Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data
This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies, and complements the UD 2.0 release with 18 new parallel test sets and 4 test sets in surprise languages.
Datasets of Slovene and Croatian Moderated News Comments
This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian
hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene
Two new annotated web corpora are introduced, built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall.