Canonical and Surface Morphological Segmentation for Nguni Languages

  title={Canonical and Surface Morphological Segmentation for Nguni Languages},
  author={Tumi Moeng and Sheldon Reay and Aaron B. Daniels and Jan Buys},
Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may… 

Subword Segmental Language Modelling for Nguni Languages

A subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

Research on the Uyghur morphological segmentation model with an attention mechanism

An improved labelling scheme that joins morphological boundary labels and voice harmony labels for the two kinds of segmentation simultaneously is proposed, and the experimental results show that the F1 values of canonical segmentation and surface segmentation achieve the best results.

Weakly Supervised Word Segmentation for Computational Language Documentation

The experiments on two very low resource languages (Mboshi and Japhug), whose documentation is still in progress, show that weak supervision can be beneficial to the segmentation quality and open the way for interactive annotation tools for documentary linguists.

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

A Masked Segmental Language Model is introduced for joint language modeling and unsupervised segmentation, built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability.

Constructing a Derivational Morphology Resource with Transformer Morpheme Segmentation

This paper describes a framework for the creation of new derivational morphology databases for a selected set of productive affixes in English. The sample resource obtained comprises almost 120k

Developing Core Technologies for Resource-Scarce Nguni Languages

The curation and annotation of corpora and the development of multiple linguistic technologies for four official South African languages, namely isiNdebele, Siswati, isiXhosa, and isiZulu are described.

Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation.

The results indicate the value of transfer learning from closely related languages to enhance the performance of low-resource translation models, thus providing a key strategy for low- resource translation going forward.

Identifying Relation Between Miriek and Kenyah Badeng Language by Using Morphological Analyzer

Miriek and Kenyah-Badeng are native minority languages in Sarawak with a dwindling number of speakers and are spoken particularly in the state's northern region [4]. Miriek and Kenyah-Badeng are

Morphological Processing of Low-Resource Languages: Where We Are and What’s Next

It is argued that the field is ready to tackle the logical next challenge: understanding a language’s morphology from raw text alone, and the stakes are high: solving this task will increase the language coverage of morphological resources by a number of magnitudes.



Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields

It is shown that the fully supervised boundary prediction approach outperforms the state-of-art semi-supervised morph lexicon approaches on all languages when using the same annotated data sets.

Unsupervised models for morpheme segmentation and morphology learning

Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data.

Cross-lingual Word Segmentation and Morpheme Segmentation as Sequence Labelling

This paper presents the segmentation system developed for the MLP 2017 shared tasks on cross-lingual word segmentation and morpheme segmentation as character-level sequence labelling tasks and achieves outstanding accuracies when compared to the other participating systems.

Morphological Segmentation with Window LSTM Neural Networks

Novel neural network architectures that learn the structure of input sequences directly from raw input words and are subsequently able to predict morphological boundaries are proposed.

Labeled Morphological Segmentation with Semi-Markov Models

A new hierarchy of morphotactic tagsets and CHIPMUNK, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics are introduced.

Experimental Fast-Tracking of Morphological Analysers for Nguni Languages

Tests show that the high degree of shared typological properties and formal similarities among the Nguni varieties warrants a modular fast-tracking approach, and the focus lies on providing adaptations based on failure output analysis for each language.

Neural Morphological Analysis: Encoding-Decoding Canonical Segments

A character-based neural encoderdecoder model for Canonical morphological segmentation is proposed and extended to include morphemelevel and lexical information through a neural reranker.

Ukwabelana - An open-source morphological Zulu corpus

The agglutinating morphology of Zulu with its multiple prefixation and suffixation is described, and the labeling scheme is introduced, and a new open-source morphological corpus for Zulu named Ukwabelana corpus is presented.

Neural Sequence-to-sequence Learning of Internal Word Structure

This paper presents a neural encoder-decoder model that combines character-level sequence-to-sequence transformation with a language model over canonical segments for learning canonical morphological segmentation and shows that including corpus counts is beneficial to both approaches.

A Joint Model of Orthography and Morphological Segmentation

A model of morphological segmentation that jointly learns to segment and restore orthographic changes, e.g., funniest7! fun-y-est, is presented and an importance sampling algorithm for approximate inference is derived.