Parallel corpora for medium density languages
A general methodology for rapidly collecting, building, and aligning parallel corpora for medium density languages, illustrating the main points on the case of Hungarian, Romanian, and Slovenian is described.
HunPos: an open source trigram tagger
HunPos is presented, a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully utilize the potential of HMM architectures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
Dynamics of Conflicts in Wikipedia
This work builds up samples of controversial and peaceful articles and analyze the temporal characteristics of the activity in these samples, and identifies three distinct developmental patterns for the overall behavior of the articles.
Digital Language Death
It is argued that this consensus figure vastly underestimates the danger of digital language death, in that less than 5% of all languages can still ascend to the digital realm.
Creating Open Language Resources for Hungarian
An extensive crawl of the .hu domain yielded a raw dataset of over 18m web pages, and the methods used to detect and remove duplicates, low quality, foreign, and mixed language documents are discussed.
Edit Wars in Wikipedia
A new, efficient method for automatically detecting severe conflicts, `edit wars' in Wikipedia is presented and this method is evaluated on six different language Wikipedias.
How many words are there?
The commonsensical assumption that any language has only finitely many words is shown to be false by a combination of formal and empirical arguments. Zipf's Law and related formulas are investigated
Extended finite state models of language
To bring together those developing and using extended finite state methods to text analysis, speech/OCR language modeling, and related CL and NLP tasks with those in AI and CS interested in analyzing and possibly extending the domain of finite state algorithms, a workshop was held in August 1996 in Budapest.
Robust language-independent OCR system
A language-independent optical character recognition system that is capable, in principle, of recognizing printed text from most of the world's languages, using hidden Markov modeling technology to model each character.
Hunmorph: Open Source Word Analysis
The functionality of the open source spellchecker MySpell is extended, yielding a generic word analysis library, the runtime layer of the hunmorph toolkit, and an offline resource management component, hunlex, which complements the efficiency of the authors' runtime layer with a high-level description language and a configurable precompiler.