• Publications
  • Influence
Self organization of a massive document collection
TLDR
A system that is able to organize vast document collections according to textual similarities based on the self-organizing map (SOM) algorithm, based on 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms. Expand
Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0
TLDR
The first public version of the Morfessor software is described, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text. Expand
Unsupervised Discovery of Morphemes
TLDR
Two methods for unsupervised segmentation of words into morpheme-like units are presented based on the Minimum Description Length (MDL) principle and Maximum Likelihood (ML) optimization is used. Expand
INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT
TLDR
An algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. Expand
Unsupervised models for morpheme segmentation and morphology learning
TLDR
Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data. Expand
Unsupervised models for morpheme segmentation and morphology learning
WEBSOM - Self-organizing maps of document collections
TLDR
Special consideration is given to the computation of very large document maps which is possible with general-purpose computers if the dimensionality of the word category histograms is first reduced with a random mapping method and if computationally efficient algorithms are used in computing the SOMs. Expand
Comparing Self-Organizing Maps
TLDR
Two measures for comparing how different maps represent relations between data items are developed, one of which combines an index of discontinuities in the mapping from the input data set to the map grid with a measure of the accuracy with which the map represents the data set. Expand
Mining massive document collections by the WEBSOM method
TLDR
This work contains an overview to the WEbsOM method and its performance, and as a special application, the WEBSOM map of the texts of Encyclopaedia Britannica is described. Expand
Semi-Supervised Learning of Concatenative Morphology
TLDR
Morfessor Baseline is extended and it is shown that known linguistic segmentations can be exploited by adding them into the data likelihood function and optimizing separate weights for unlabeled and labeled data. Expand
...
1
2
3
4
5
...