• Publications
  • Influence
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition. Expand
Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotationExpand
Apertium: a free/open-source platform for rule-based machine translation
The Apertium platform is summarised: the translation engine, the encoding of linguistic data, and the tools developed around the platform are discussed. Expand
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
The task and evaluation methodology is defined, how the data sets were prepared, report and analyze the main results, and a brief categorization of the different approaches of the participating systems are provided. Expand
Universal Dependencies 2.1
The annotation scheme is based on (universal) Stanford dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets for morpho-lingual tagsets. Expand
Universal Dependencies for Turkish
The findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank. Expand
Universal Dependencies 2.0 – CoNLL 2017 Shared Task Development and Test Data
This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies, and complements the UD 2.0 release with 18 new parallel test sets and 4 test sets in surprise languages. Expand
Open morphology of Finnish
Omorfi is free and open source project containing various tools and data for handling Finnish texts in a linguistically motivated manner and a collection of scripts to convert lexical database into formats used by upstream NLP tools. Expand
Finite-state morphological transducers for three Kypchak languages
How the development of a transducer for each subsequent closely-related language took less development time is described, which shows that the transducers all have a reasonable coverage on freely available corpora of the languages, and high precision over a manually verified test set. Expand
Extracting bilingual word pairs from Wikipedia
A bilingual dictionary or word list is an important resource for many purposes, among them, machine translation. For many language pairs these are either non-existent, or very often unavailable owingExpand