• Corpus ID: 233004275

Low-Resource Language Modelling of South African Languages

@article{Mesham2021LowResourceLM,
  title={Low-Resource Language Modelling of South African Languages},
  author={Stuart Mesham and Luc Hayward and Jared Shapiro and Jan Buys},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00772}
}
Language models are the foundation of current neural network-based models for natural language understanding and generation. However, research on the intrinsic performance of language models on African languages has been extremely limited, which is made more challenging by the lack of large or standardised training and evaluation sets that exist for English and other high-resource languages. In this paper, we evaluate the performance of open-vocabulary language models on lowresource South… 

Figures and Tables from this paper

Subword Segmental Language Modelling for Nguni Languages

A subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling, enabling the model to discover morpheme-like subwords that improve its LM capabilities.

Bootstrapping NLP tools across low-resourced African languages: an overview and prospects

An overview ofBootstrapping grammars for geographically distant languages has been shown to still have positive outcomes for morphology and rules or grammar-based natural language generation, with both fer-tile ground for further research.

Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation

It is concluded that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.

References

SHOWING 1-10 OF 34 REFERENCES

Developing Text Resources for Ten South African Languages

The process and challenges of simultaneously developing multiple linguistic resources for ten of the official languages of South Africa are described and the quality of these tools for each language are reported on.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Neural Machine Translation of Rare Words with Subword Units

This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

An Analysis of Neural Language Modeling at Multiple Scales

This work takes existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity, achieving state- of- the-art results on character- level and word-level datasets.

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

It is shown how the spellings of known words can help us deal with unknown words in open-vocabulary NLP tasks and beat previous work and establish state-of-the-art results on multiple datasets.

Direct Output Connection for a High-Rank Language Model

This paper proposes a state-of-the-art recurrent neural network (RNN) language model that combines probability distributions computed not only from a final RNN layer but also middle layers, and indicates the proposed method contributes to application tasks: machine translation and headline generation.

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

A Neural Probabilistic Language Model

This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

The dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers, is introduced.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.