• Corpus ID: 14886349

Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0

@inproceedings{Creutz2005UnsupervisedMS,
  title={Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0},
  author={Mathias Creutz and K. Lagus},
  year={2005}
}
In this work, we describe the first public version of the Morfessor software, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. Morfessor is not language-dependent. The number of segments per word is not restricted to two or three as in some other existing morphology learning models. The current version of the software essentially… 

Figures and Tables from this paper

Unsupervised models for morpheme segmentation and morphology learning
TLDR
Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data.
Unsupervised segmentation of words into morphemes - Challenge 2005, An Introduction and Evaluation Report
TLDR
A statistical machine learning algorithm is designed that segments words into the smallest meaning-bearing units of language, morphemes, Ideally these are basic vocabulary units suitable for different tasks, such as speech and text understanding, machine translation, information retrieval, and statistical language modeling.
Unsupervised Acquiring of Morphological Paradigms from Tokenized Text
TLDR
Although quite simple, this approach outperformed, to the surprise, several others in most morpheme segmentation subcompetitions, and there is enough room for improvements that can put the results even higher.
Induction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognition
TLDR
The main objective of this thesis is to devise a method that discovers the likely locations of the morpheme boundaries in words of any language by learning a simple model of concatenative morphology (word forming) in an unsupervised manner from plain text.
INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT
TLDR
An algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora.
Jonáš Vidra Morphological segmentation of CzechWords Institute of Formal and
  • Computer Science
  • 2018
TLDR
The task of this thesis is to create an automatic method for segmenting Czech words into morphemes, usable within the network of Czech derivational relations DeriNet, and to create a neural network made to jointly predict segmentation and derivational parents.
Statistical and Computational Models for Whole Word Morphology
The purpose of this thesis is to provide an unsupervised machine learning approach for language morphology, in which the latter is modeled as string transformations on whole words, rather than the
Automatic Morpheme Segmentation and Labeling in Universal Dependencies Resources
TLDR
The model allows us to provide a more detailed morphosyntactic labeling and segmentation of the UD data and allows for automatic discovery, segmentation, and labeling of allomorphs in the data sets.
Unsupervised Morpheme Analysis Evaluation by IR experiments - Morpho Challenge 2007
TLDR
The results indicate that the morpheme analysis has a significant eect in IR performance in all tested languages (Finnish, English and German) and can also rival the best language-dependent word normalization methods.
Statistical models for unsupervised learning of morphology and POS tagging
TLDR
The results show that a joint model is possible for learning morphology and POS tagging, in which POS tags and morphology are learned simultaneously, and a model to capture paradigms through syntactic categories is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 16 REFERENCES
Morpheme Segmentation Gold Standards for Finnish and English
This document describes Hutmegs, the Helsinki University of Technology Morphological Evaluation Gold Standard package, which contains gold-standard morphological segmentations for 1.4 million Finnish
INDUCING THE MORPHOLOGICAL LEXICON OF A NATURAL LANGUAGE FROM UNANNOTATED TEXT
TLDR
An algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora.
Induction of a Simple Morphology for Highly-Inflecting Languages
TLDR
An algorithm for the unsupervised learning of a simple morphology of a natural language from raw text using a generative probabilistic model to segment word forms into morphs, which makes the model suitable for highly-inflecting languages.
Unsupervised Discovery of Morphemes
TLDR
Two methods for unsupervised segmentation of words into morpheme-like units are presented based on the Minimum Description Length (MDL) principle and Maximum Likelihood (ML) optimization is used.
Unsupervised Learning of the Morphology of a Natural Language
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size
An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery
  • M. Brent
  • Computer Science
    Machine Learning
  • 2004
TLDR
Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that the model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
On lexicon creation for turkish LVCSR
TLDR
This paper addresses the lexicon design problem in Turkish large vocabulary speech recognition using morphology-based and data-driven methods and presents experimental results that show the methods are very effective to lower the word error rate at the expense of lexicon size.
Morphological Analysis for Statistical Machine Translation
We present a novel morphological analysis technique which induces a morphological and syntactic symmetry between two languages with highly asymmetrical morphological structures to improve statistical
Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency
We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant
A Self-Organizing Japanese Word Segmenter using Heuristic Word Identification and Re-estimation
We present a self-organized method to build a stochastic Japanese word segmenter from a small number of basic words and a large amount of unsegmented training text. It consists of a word-based
...
1
2
...