• Corpus ID: 7365958

Unsupervised Multilingual Learning for Morphological Segmentation

@inproceedings{Snyder2008UnsupervisedML,
  title={Unsupervised Multilingual Learning for Morphological Segmentation},
  author={Benjamin Snyder and Regina Barzilay},
  booktitle={ACL},
  year={2008}
}
For centuries, the deep connection between languages has brought about major discoveries about human communication. In this paper we investigate how this powerful source of information can be exploited for unsupervised language learning. In particular, we study the task of morphological segmentation of multiple languages. We present a nonparametric Bayesian model that jointly induces morpheme segmentations of each language under consideration and at the same time identifies cross-lingual… 

Figures and Tables from this paper

Unsupervised multilingual learning
TLDR
A class of probabilistic models that exploit deep links among human languages as a form of naturally occurring supervision allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing.
Semi-Supervised Learning of Concatenative Morphology
TLDR
Morfessor Baseline is extended and it is shown that known linguistic segmentations can be exploited by adding them into the data likelihood function and optimizing separate weights for unlabeled and labeled data.
Unsupervised Morphological Segmentation with Log-Linear Models
TLDR
This paper presents the first log-linear model for unsupervised morphological segmentation, based on monolingual features only, which outperforms a state-of-the-art system by a large margin, even when the latter uses bilingual information such as phrasal alignment and phonetic correspondence.
Climbing the Tower of Babel: Unsupervised Multilingual Learning
TLDR
A class of probabilistic models that use these links among human languages as a form of naturally occurring supervision allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing.
A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation
TLDR
The results show that using different information sources such as neural word embeddings and letter successor variety as prior information improves morphological segmentation in a Bayesian model.
Modeling Morphological Typology for Unsupervised Learning of Language Morphology
TLDR
A language-independent model for fully unsupervised morphological analysis that exploits a universal framework leveraging morphological typology and investigates the effect of an oracle that provides only a handful of bits per language to signal morphological type.
Building Morphological Chains for Agglutinative Languages
TLDR
The results indicate that candidate generation plays an important role in such an unsupervised log-linear model that is learned using contrastive estimation with negative samples.
Context-dependent type-level models for unsupervised morpho-syntactic induction
TLDR
This thesis improves unsupervised methods for part-of-speech (POS) induction and morphological word segmentation by modeling linguistic phenomena previously not used by exploiting the fact that affixes are correlated within a word and between adjacent words.
Neural sequence-to-sequence models for low-resource morphology
TLDR
This thesis presents approaches to generate morphological inflections or analyze word forms through canonical segmentation through state-of-the art deep learning models and shows that similar languages can improve performance of morphological generation systems in low-resource settings.
Probabilistic modelling of morphologically rich languages
TLDR
This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language, and forms a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and develops a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words.
...
...

References

SHOWING 1-10 OF 30 REFERENCES
Cross-lingual Propagation for Morphological Analysis
TLDR
The proposed non-parametric Bayesian model effectively combines cross-lingual alignment with target language predictions, and is a potent alternative to projection methods which decompose these decisions into two separate stages of morphological segmentation.
Unsupervised Learning of the Morphology of a Natural Language
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size
Unsupervised models for morpheme segmentation and morphology learning
TLDR
Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data.
An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation
TLDR
A text encoding method for languages with affixational morphology in which the knowledge of word formation rules helps in the disambiguation, and adapt HMM algorithms for learning and searching this text representation, in such a way that segmentation and tagging can be learned in parallel in one step.
A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
TLDR
This work takes a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons, and presents an alternative evaluation metric for this system, where it is shown how much human labor will be needed to convert the result of the tagging to a high precision annotated resource.
Contextual Dependencies in Unsupervised Word Segmentation
TLDR
Two new Bayesian word segmentation methods are proposed that assume unigram and bigram models of word dependencies respectively, and the bigram model greatly outperforms the unigrams model (and previous probabilistic models), demonstrating the importance of such dependencies forword segmentation.
Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages
TLDR
Experimental results demonstrate that the proposed new bootstrapping approach to unsupervised part-of-speech induction works well for English and Bengali, thus providing suggestive evidence that it is applicable to both morphologically impoverished languages and highly inflectional languages.
A Backoff Model for Bootstrapping Resources for Non-English Languages
TLDR
This paper proposes a novel approach of combining a bootstrapped resource with a small amount of manually annotated data and shows that this approach achieves a significant improvement over EM and self-training and systems that are only trained on manual annotations.
An Unsupervised Method for Word Sense Tagging using Parallel Corpora
TLDR
An unsupervised method for word sense disambiguation that exploits translation correspondences in parallel corpora is presented, using pseudo-translations, created by machine translation systems, in order to make possible the evaluation of the approach against a standard test set.
Unsupervised Learning of Arabic Stemming Using a Parallel Corpus
TLDR
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer that uses an English stemmer and a small (10 K sentences) parallel corpus as its sole training resources.
...
...