The choice of natural language technology appropriate for a given language is greatly impacted by ‘density’ (availability of digitally stored material). More than half of the world speaks medium… (More)
In this work we study the dynamical features of editorial wars in Wikipedia (WP). Based on our previously established algorithm, we build up samples of controversial and peaceful articles and analyze… (More)
The paper provides an overview of the open source Hungarian language resources that the SzóSzablya ‘WordSword’ project is creating. An extensive crawl of the .hu domain yielded a raw dataset of over… (More)
We present a language-independent optical character recognition (OCR) system that is capable, in principle, of recognizing printed text from most of the world’s languages. For each new language or… (More)
We present a new, efficient method for automatically detecting severe conflicts, `edit wars' in Wikipedia and evaluate this method on six different language Wikipedias. We discuss how the number of… (More)
Common tasks involving orthographic words include spellchecking, stemming, morphological analysis, and morphological synthesis. To enable significant reuse of the language-specific resources across… (More)
In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to… (More)
Of the approximately 7,000 languages spoken today, some 2,500 are generally considered endangered. Here we argue that this consensus figure vastly underestimates the danger of digital language death,… (More)
The paper presents an evaluation of maxent POS disambiguation systems that incorporate an open source morphological analyzer to constrain the probabilistic models. The experiments show that the best… (More)