Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

  title={Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets},
  author={Seid Muhie Yimam and Abinew Ali Ayele and Gopalakrishnan Venkatesh and Christian Biemann},
The availability of different pre-trained semantic models has enabled the quick development of machine learning components for downstream applications. However, even if texts are abundant for low-resource languages, there are very few semantic models publicly available. Most of the publicly available pre-trained models are usually built as a multilingual version of semantic models that will not fit well with the need for low-resource languages. We introduce different semantic models for Amharic… 

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word- based corpus.

Romanization-based Large-scale Adaptation of Multilingual Language Models

The results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups: on languages with unseen scripts and with limited training data without any vocabulary augmentation.

Developing Amharic Question Answering Model Over Unstructured Data Source Using Deep Learning Approach

  • Abenezer Mengistu Elema
  • Computer Science
    2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA)
  • 2022
With a reasonably large Amharic QA dataset compared to datasets used in previous studies and without handcrafted rules, linguistic tools, and frameworks, this study’s end-to-end deep neural network models have outperformed previous Amhari QA systems.

GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages

This work presents a language identification dataset for five low-resourced East African languages that use the Ge’ez script as a writing system, integrated into an existing language-identification tool and fine-tuned several Transformer based language models, achieving very strong results in all cases.

Question Answering Classification for Amharic Social Media Community Based Questions

This work builds a Question Answering (QA) classification dataset from a social media platform, namely the Telegram public channel called @AskAnythingEthiopia, and develops deep learning-based questions answering classifiers that attain as high as an F-score of 57.29 in 20 different question classes or categories.

Challenges of Amharic Hate Speech Data Annotation Using Yandex Toloka Crowdsourcing Platform

The main challenges of crowdsourcing annotation for Amharic hate speech data collection using Yandex Toloka are explored and deep learning based classification models with LSTM and BiLSTM are built and achieved a 0.44 F1-score for both models.

Multilingual Open Text Release 1: Public Domain News in 44 Languages

We present a Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first

Multilingual Open Text 1.0: Public Domain News in 44 Languages

We present a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over

The 5Js in Ethiopia: Amharic Hate Speech Data Annotation Using Toloka Crowdsourcing Platform

The main challenges of crowdsourcing annotation for Amharic hate speech data collection using Toloka are explored and a Fliess kappa score of 0.34 is attained using three independent annotators that annotate the tweets and the gold label is determined using majority voting.

Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta and provides a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages.

Give your Text Representation Models some Love: the Case for Basque

A number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER.

Embed More Ignore Less (EMIL): Exploiting Enriched Representations for Arabic NLP

It is shown that embedding the information that is encoded in automatically acquired Arabic diacritics improves the performance across all datasets on both NER and POS tagging.

FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP

The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.

Contextual String Embeddings for Sequence Labeling

This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.

High Quality ELMo Embeddings for Seven Less-Resourced Languages

It is demonstrated that the quality of embeddings strongly depends on the size of the training set and it is shown that existing publicly available ELMoembeddings for listed languages shall be improved.

Can Network Embedding of Distributional Thesaurus Be Combined with Word Vectors for Better Representation?

This is the first attempt where it is shown that combining the proposed word representation obtained by distributional thesaurus embedding with the state-of-the-art word representations helps in improving the performance by a significant margin when evaluated against NLP tasks like word similarity and relatedness, synonym detection, analogy detection.

Word Similarity Datasets for Thai: Construction and Evaluation

Three Thai word similarity datasets are created by translating and re-rating the popular WordSim-353, SimLex-999 and SemEval-2017-Task-2 datasets, and include baseline evaluations with existing Thai embedding models to gain a broader picture of the properties of an evaluated word embedding model.

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

A new version of the linked open data resource ConceptNet is presented that is particularly well suited to be used with modern NLP techniques such as word embeddings, with state-of-the-art results on intrinsic evaluations of word relatedness that translate into improvements on applications of word vectors, including solving SAT-style analogies.

Embeddings in Natural Language Processing

This tutorial will provide a high-level synthesis of the main embedding techniques in NLP, in the broad sense, and start by conventional word embeddings and then move to other types of embedDings, such as sense-specific and graph alternatives.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.