A system that is able to organize vast document collections according to textual similarities based on the self-organizing map (SOM) algorithm, based on 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.Expand

The first public version of the Morfessor software is described, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text.Expand

Two methods for unsupervised segmentation of words into morpheme-like units are presented based on the Minimum Description Length (MDL) principle and Maximum Likelihood (ML) optimization is used.Expand

An algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora.Expand

Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data.Expand

Special consideration is given to the computation of very large document maps which is possible with general-purpose computers if the dimensionality of the word category histograms is first reduced with a random mapping method and if computationally efficient algorithms are used in computing the SOMs.Expand

Two measures for comparing how different maps represent relations between data items are developed, one of which combines an index of discontinuities in the mapping from the input data set to the map grid with a measure of the accuracy with which the map represents the data set.Expand

This work contains an overview to the WEbsOM method and its performance, and as a special application, the WEBSOM map of the texts of Encyclopaedia Britannica is described.Expand

Morfessor Baseline is extended and it is shown that known linguistic segmentations can be exploited by adding them into the data likelihood function and optimizing separate weights for unlabeled and labeled data.Expand