• Corpus ID: 6063298

High-Performance, Language-Independent Morphological Segmentation

  title={High-Performance, Language-Independent Morphological Segmentation},
  author={Sajib Dasgupta and Vincent Ng},
This paper introduces an unsupervised morphological segmentation algorithm that shows robust performance for four languages with different levels of morphological complexity. In particular, our algorithm outperforms Goldsmithis Linguistica and Creutz and Lagusis Morphessor for English and Bengali, and achieves performance that is comparable to the best results for all three PASCAL evaluation datasets. Improvements arise from (1) the use of relative corpus frequency and suffix level similarity… 

Tables from this paper

Unsupervised morphological parsing of Bengali
This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo–Aryan language that is highly inflectional in nature.
Unsupervised Acquiring of Morphological Paradigms from Tokenized Text
Although quite simple, this approach outperformed, to the surprise, several others in most morpheme segmentation subcompetitions, and there is enough room for improvements that can put the results even higher.
The Study of Effect of Length in Morphological Segmentation of Agglutinative Languages
A simple unsupervised model for morphological segmentation and it is shown that, knowledge of morph length has a positive impact and provides competitive results in terms of overall performance.
Morpheme Segmentation for Highly Agglutinative Tamil Language by Means of Unsupervised Learning
The importance of unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection for Tamil language which are highly inflectional and agglutinative in morphology is illustrated.
Unsupervised Morphology Learning with Statistical Paradigms
An unsupervised model for morphological segmentation that exploits the notion of paradigms, which are sets of morphological categories that can be applied to a homogeneous set of words, and chooses reliable suffixes from them to improve segmentation accuracy.
Unsupervised morphological segmentation and clustering with document boundaries
A simple approach to unsupervised morphology acquisition is presented that uses no thresholds other than those involved in standard application of X2 significance testing, using document boundaries to constrain generation of candidate stems and affixes and clustering morphological variants of a given word stem.
Morpheme Segmentation for Kannada Standing on the Shoulder of Giants
  • S. Bhat
  • Linguistics
  • 2012
This paper studies the applicability of a set of state-of-the-art unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection in Kannada, a resource-poor
Automatic Morpheme Segmentation and Labeling in Universal Dependencies Resources
The model allows us to provide a more detailed morphosyntactic labeling and segmentation of the UD data and allows for automatic discovery, segmentation, and labeling of allomorphs in the data sets.
Unsupervised learning of agglutinated morphology using nested Pitman-Yor process based morpheme induction algorithm
A method of morphologically segment highly agglutinating and inflectional languages from the Dravidian family using the nested Pitman-Yor process and a corpus based morpheme induction algorithm to perform morphe me segmentation.
Allomorfessor: Towards Unsupervised Morpheme Analysis
This work extends the unsupervised morpheme segmentation method Morfessor Baseline to account for the linguistic phenomenon of allomorphy, and shows that a small model change gives state-of-the-art results.


Unsupervised Word Segmentation for Bangla
A simple, yet highly effective algorithm for unsupervised word segmentation for Bangla, an Indo-Aryan language that is highly inflectional in nature, achieves an F-score of 84%, substantially outperforming Linguistica, one of the most widely-used unsuper supervised morphological analyzers.
Unsupervised Morphological Segmentation Based on Segment Predictability and Word Segments Alignment
An unsupervised method for the segmentation of words into sub-units devised for this objective relies on segment predictability to discover a set of prefixes and suffixes and performs word segments alignment to detect morpheme boundaries.
Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency
We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant
A Bayesian Model for Morpheme and Paradigm Identification
A system for unsupervised learning of morphological affixes from texts or word lists composed of a generative probability model and a search algorithm that can be formalized in terms of the lattice formed by subsets of suffixes under inclusion.
Unsupervised Learning of the Morphology of a Natural Language
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size
A Simpler , Intuitive Approach to Morpheme Induction
We present a simple, psychologically plausible algorithm to perform unsupervised learning of morphemes. The algorithm is most suited to Indo-European languages with a concatenative morphology, and in
Minimally Supervised Morphological Analysis by Multimodal Alignment
A corpus-based algorithm capable of inducing inflectional morphological analyses of both regular and highly irregular forms from distributional patterns in large monolingual text with no direct supervision is presented.
Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0
The first public version of the Morfessor software is described, which is a program that takes as input a corpus of unannotated text and produces a segmentation of the word forms observed in the text.
Morphology Induction from Term Clusters
This work addresses the problem of learning a morphological automaton directly from a monolingual text corpus without recourse to additional resources by searching for affix transformation rules that express correspondences between term clusters induced from the data.
Knowledge-Free Induction of Inflectional Morphologies
An algorithm to automatically induce the morphology of inflectional languages using only text corpora and no human input is proposed, showing it to be an improvement over any knowledge-free algorithm yet proposed.