Subword Segmental Language Modelling for Nguni Languages

@article{Meyer2022SubwordSL,
  title={Subword Segmental Language Modelling for Nguni Languages},
  author={Francois Meyer and Jan Buys},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.06525}
}
Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morpholo-gies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying… 

References

SHOWING 1-10 OF 37 REFERENCES

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

A Masked Segmental Language Model is introduced for joint language modeling and unsupervised segmentation, built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability.

Canonical and Surface Morphological Segmentation for Nguni Languages

This paper train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation and investigates supervised and unsupervised models for two variants of morphological segmentation.

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

The first to propose a neural model for unsupervised CWS and achieve competitive performance to the state-of-the-art statistical models on four different datasets from SIGHAN 2005 bakeoff is proposed.

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

A simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality is proposed, naming the proposed model architecture KinyaBERT.

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

A simple regularization method is presented, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training, and a new sub word segmentation algorithm based on a unigram language model is proposed.

A Systematic Study of Leveraging Subword Information for Learning Word Representations

This work proposes a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position embeddings and self-attention.

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

It is shown that it is possible to train competitive multilingual language models on less than 1 GB of text and results suggest that the “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.

Unsupervised Word Segmentation with Bi-directional Neural Language Model

Experimental results show that the context-sensitive unsupervised segmentation model achieved state-of-the-art at different evaluation settings on various datasets for Chinese, and the comparable result for Thai.

Multi-view Subword Regularization

To take full advantage of different possible input segmentations, the proposed Multi-view Subword Regularization (MVR) method enforces the consistency of predictors between using inputs tokenized by the standard and probabilistic segmentations.

Neural Machine Translation of Rare Words with Subword Units

This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.