Matti Varjokallio

Learn More
We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the <i>Morfessor</i> algorithm. By estimating(More)
We analyze subword-based language models (LMs) in large-vocabulary continuous speech recognition across four “morphologically rich” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity is obtained. Standard word LMs(More)
This paper presents the evaluation of Morpho Challenge Competition 1 (linguistic gold standard). The Competition 2 (information retrieval) is described in a companion paper. In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (smallest individually meaningful units of language) words(More)
The objective of the challenge for the unsupervised segmentation of words into morphemes, or shorter the Morpho Challenge, was to design a statistical machine learning algorithm that segments words into the smallest meaning-bearing units of language, morphemes. Ideally, these are basic vocabulary units suitable for different tasks, such as speech and text(More)
In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical(More)
String segmentation is an important and recurring problem in natural language processing and other domains. For morphologically rich languages, the amount of different word forms caused by morphological processes like agglutination, compounding and inflection, may be huge and causes problems for traditional word-based language modeling approach. Segmenting(More)
The goal of Morpho Challenge 2008 was to find and evaluate unsupervised algorithms that provide morpheme analyses for words in different languages. Especially in morphologically complex languages, such as Finnish, Turkish and Arabic, morpheme analysis is important for lexical modeling of words in speech recognition, information retrieval and machine(More)