• Corpus ID: 232478434

Mining Wikidata for Name Resources for African Languages

@article{Saleva2021MiningWF,
  title={Mining Wikidata for Name Resources for African Languages},
  author={Jonne Saleva and Constantine Lignos},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.00558}
}
This work supports further development of language technology for the languages of Africa by providing a Wikidata-derived resource of name lists corresponding to common entity types (person, location, and organization). While we are not the first to mine Wikidata for name lists, our approach emphasizes scalability and replicability and addresses data quality issues for languages that do not use Latin scripts. We produce lists containing approximately 1.9 million names across 28 African… 

Figures and Tables from this paper

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

A simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality is proposed, naming the proposed model architecture KinyaBERT.

References

SHOWING 1-7 OF 7 REFERENCES

LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages

This paper describes the textual linguistic resources in nearly 3 dozen languages being produced by Linguistic Data Consortium for DARPA’s LORELEI (Low Resource Languages for Emergent Incidents) Program, and treats the full set of language packs as a coherent whole.

Government Domain Named Entity Recognition for South African Languages

The development efforts focused on creating protocols and annotated data sets with at least 15,000 annotated named entity tokens for ten of the official South African languages provide an overview of the problems encountered during the annotation of the data sets.

Transliterating From All Languages

These analyses are particularly valuable for building machine translation systems for low resource languages, where creating and integrating a transliteration module for a language with few NLP resources may provide substantial gains in translation performance.

Creating a Translation Matrix of the Bible's Names Across 591 Languages

A novel resource of 1129 aligned Bible person and place names across 591 languages is developed and released using several approaches including weighted edit distance, machine-translation-based transliteration models, and affixal induction and transformation models, showing the particular efficacy of this approach on the impactful task of broadly multilingual named-entity alignment and translation.

Effective Architectures for Low Resource Multilingual Named Entity Transliteration

It is found that using a Transformer for the encoder and decoder performs best, improving accuracy by over 4 points compared to previous work and the Transformer encoder is better able to handle insertions and substitutions when transliterating.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Soft gazetteers for lowresource named entity recognition

  • Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8118–8123, On-
  • 2020