MasakhaNER: Named Entity Recognition for African Languages
@article{Adelani2021MasakhaNERNE, title={MasakhaNER: Named Entity Recognition for African Languages}, author={David Ifeoluwa Adelani and Jade Z. Abbott and Graham Neubig and Daniel D'souza and Julia Kreutzer and Constantine Lignos and Chester Palen-Michel and Happy Buzaaba and Shruti Rijhwani and Sebastian Ruder and Stephen Mayhew and Israel Abebe Azime and Shamsuddeen Hassan Muhammad and Chris C. Emezue and Joyce Nakatumba-Nabende and Perez Ogayo and Anuoluwapo Aremu and Catherine Gitau and Derguene Mbaye and Jesujoba Oluwadara Alabi and Seid Muhie Yimam and Tajuddeen Rabiu Gwadabe and Ignatius U. Ezeani and Andre Niyongabo Rubungo and Jonathan Mukiibi and Verrah Otiende and Iroro Orife and Davis David and Samba Ngom and Tosin P. Adewumi and Paul Rayson and Mofetoluwa Adeyemi and Gerald Muriuki and Emmanuel Anebi and Chiamaka Ijeoma Chukwuneke and Nkiruka Bridget Odu and Eric Peter Wairagala and Samuel Oyerinde and Clemencia Siro and Tobius Saul Bateesa and Temilola Oloyede and Yvonne Wambui and Victor Akinode and Deborah Nabagereka and Maurice Katusiime and Ayodele Awokoya and Mouhamadane Mboup and Dibora Gebreyohannes and Henok Tilaye and Kelechi Nwaike and Degaga Wolde and Abdoulaye N Faye and Blessing K. Sibanda and Orevaoghene Ahia and Bonaventure F. P. Dossou and Kelechi Ogueji and Thierno Ibrahima Diop and Abdoulaye Diallo and Adewale Akinfaderin and Tendai Munyaradzi Marengereke and Salomey Osei}, journal={Transactions of the Association for Computational Linguistics}, year={2021}, volume={9}, pages={1116-1131} }
Abstract We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation…
79 Citations
M ULTILINGUAL L ANGUAGE M ODEL A DAPTIVE F INE T UNING : A S TUDY ON A FRICAN L ANGUAGES
- Computer Science
- 2022
The adapted PLM shows that the approach is competitive to applying LAFT on individual languages while requiring significantly less disk space and improves the zero-shot cross-lingual transfer abilities of parameter ef fine-tuning methods.
Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages
- Computer ScienceArXiv
- 2022
The adapted PLM shows that its approach is competitive to applying LAFT on individual languages while requiring significantly less disk space and improves the zero-shot cross-lingual transfer abilities of parameter ef fine-tuning methods.
AfriNames: Most ASR models"butcher"African Names
- Computer Science
- 2023
It is demonstrated that model bias can be mitigated through multilingual pre-training, intelligent data augmentation strategies to increase the representation of African-named entities, and fine-tuning multilingual ASR models on multiple African accents.
Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging
- Computer Science
- 2023
A simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning, which simultaneously substantially desensitizes XLT to varying hyperparameter choices in the absence of target language validation.
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
- Computer Science
- 2023
This work introduces a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few- shot examples and instructions and suggests various avenues for future research in few-shot cross-lingual transfer, such as improved pretraining, understanding, and future evaluations.
GlobalBench: A Benchmark for Global Progress in Natural Language Processing
- Computer Science
- 2023
This work introduces GlobalBench, an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages and tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world.
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
- Computer Science, Linguistics
- 2023
Evaluating on the MasakhaPOS dataset, it is shown that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods.
LLM-powered Data Augmentation for Enhanced Crosslingual Performance
- Computer Science
- 2023
Human evaluation reveals that LLMs like ChatGPT and GPT-4 excel at generating natural text in most languages, except a few such as Tamil, whileChatGPT trails behind in generating plausible alternatives in comparison to the original dataset, while GPT -4 demonstrates competitive logic consistency in the synthesised data.
Transfer-Free Data-Efficient Multilingual Slot Labeling
- Computer Science
- 2023
This work examines challenging scenarios where such transfer-enabling English annotated data cannot be guaranteed, and proposes a two-stage slot labeling approach (termed TWOSL) which transforms standard multilingual sentence encoders into effective slot labelers.
Better Low-Resource Entity Recognition Through Translation and Annotation Fusion
- Computer Science
- 2023
This work presents TransFusion, a model trained to fuse predictions from a high-resource language to make robust predictions on low-resource languages, which is robust to translation errors and source language prediction errors, and complimentary to adapted multilingual language models.
73 References
Neural Architectures for Named Entity Recognition
- BiologyNAACL
- 2016
Comunicacio presentada a la 2016 Conference of the North American Chapter of the Association for Computational Linguistics, celebrada a San Diego (CA, EUA) els dies 12 a 17 de juny 2016.
Unsupervised Cross-lingual Representation Learning at Scale
- Computer ScienceACL
- 2020
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages
- Computer Science, LinguisticsACL
- 2019
JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average is introduced and its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection is showcased.
Cross-lingual Name Tagging and Linking for 282 Languages
- Computer Science, LinguisticsACL
- 2017
This work develops a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia that is able to identify name mentions, assign a coarse-grained or fine- grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable.
Bidirectional LSTM-CRF Models for Sequence Tagging
- Computer ScienceArXiv
- 2015
This work is the first to apply a bidirectional LSTM CRF model to NLP benchmark sequence tagging data sets and it is shown that the BI-LSTM-CRF model can efficiently use both past and future input features thanks to a biddirectional L STM component.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
- Computer ScienceICML
- 2001
This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
- Computer ScienceArXiv
- 2019
The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Computer ScienceArXiv
- 2019
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Parallel Data, Tools and Interfaces in OPUS
- Computer ScienceLREC
- 2012
New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.
GloVe: Global Vectors for Word Representation
- Computer ScienceEMNLP
- 2014
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.