• Corpus ID: 13333160

Build Fast and Accurate Lemmatization for Arabic

  title={Build Fast and Accurate Lemmatization for Arabic},
  author={Hamdy Mubarak},
In this paper we describe the complexity of building a lemmatizer for Arabic which has a rich and complex derivational morphology, and we discuss the need for a fast and accurate lammatization to enhance Arabic Information Retrieval (IR) results. [] Key Method We also introduce a new data set that can be used to test lemmatization accuracy, and an efficient lemmatization algorithm that outperforms state-of-the-art Arabic lemmatization in terms of accuracy and speed. We share the data set and the code for…

Figures and Tables from this paper

Meta-search based approach for Arabic information retrieval

The authors combine 4 different morphological levels for the first time in Arabic IR, widely overtook previous research results and may be helpful for future research works to choose the most suitable tools and develop more sophisticated methods for handling the complexity of Arabic language.

Roadmap for an Arabic Controlled Language

A roadmap for developing an Arabic CNL to provide new kind and advanced natural language services for Arabic people is proposed and two major approaches are proposed; one relies on leveraging on already-built CNLs, whereas the other consists in starting from scratch.

Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification

Otrouha is presented, a framework for automatic subject classification of Arabic ETDs through different classification models that use classical machine learning as well as deep learning techniques, which shows that among the machine learning models, binary classification (one-vs-all) performed better than multiclass classification.

Arabic real time entity resolution using inverted indexing

This paper proposes a framework—Arabic Real Time Entity Resolution (ARTER)—that uses DySimII with Arabic databases to perform real time ER and examines using different string similarity functions required for comparing records in the matching process for the aim of evaluating which similarity function is more suitable for comparing Arabic strings.

Developing Set of Word Senses of Vocabulary in Al-Qur’an

This research aims to construct the word sense as a set of vocabulary, in order to simplify the vocabulary meaning in Al-Qur’an itself, and accuracy is low, due to the type and the number data used is limited.

MSTD: Moroccan Sentiment Twitter Dataset

This work presents the effect of stemming and lemmatization on the improvement of the obtained accuracies of the MSTD (Moroccan Sentiment Twitter Dataset), the largest Moroccan dataset for sentiment analysis.

Roadblocks in Gender Bias Measurement for Diachronic Corpora

This paper documents problems that arise with this method to quantify gender bias in diachronic corpora in Arabic and Chinese corpora and documents clear changes in profession words used over time and, somewhat surprisingly, even changes in the simpler gendered defining set word pairs.

A panoramic survey of natural language processing in the Arab world

Though Arabic NLP has many challenges, it has seen many successes and developments, and the last decade in particular has witnessed an incredible increase in quality, matched with a rise in public awareness, use, and expectations.

Arabic Fake News Detection Based on Textual Analysis

A supervised machine learning model is introduced that classifies Arabic news articles based on their context’s credibility and the first dataset of Arabic fake news articles composed through crowdsourcing is introduced.

A new classification model with a multi-layer dimensionality reduction approach

  • Anoual El KahI. Zeroual
  • Computer Science
    2022 11th International Symposium on Signal, Image, Video and Communications (ISIVC)
  • 2022
The benefit of using a hybrid multi-layer dimension reduction approach, especially for Arabic topic classification, is highlighted, and a bagging classifier model is implemented using the Support Vector Machines (SVM) as its base model.



An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes

The proposed lemmatizer makes use of different Arabic language knowledge resources to generate accurate lemma form and its relevant features that support IR purposes that is suitable for information retrieval (IR) systems.

Farasa: A Fast and Furious Segmenter for Arabic

Farasa outperforms or is at par with the state-of-the-art Arabic segmenters (Stanford and MADAMIRA), while being more than one order of magnitude faster.

Farasa: A New Fast and Accurate Arabic Word Segmenter

Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter, which outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA.

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic

MADAMIRA is a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude.

Arabic Diacritization: Stats, Rules, and Hacks

A new and fast state-of-the-art Arabic diacritizer that guesses the diacritis of words and then their case endings and uses Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings.

Arabic Finite-State Morphological Analysis and Generation

A large-scale system that performs morphological analysis and generation of on-line Arabic words represented in the standard orthography, whether fully voweled, partially voweled or unvoweled, using Xerox Finite-State Morphology tools.

A Rule based Approach to Word Lemmatization

When learning from a corpus of lemmatized Slovene words the RDR approach results in easy to understand rules of improved classification accuracy compared to the results of rule learning achieved in previous work.

Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques

The presented system has a significantly better performance than the existing Arabic extractor systems, where precision and recall values reach double their corresponding values in the other systems especially for lengthy and non-scientific articles.

Roots & patterns vs. stems plus grammar-lexis specifications: on what basis should a multilingual database centred on Arabic be built?

It is shown here how and why a stem-grounded lexical database, the items of which are associated with grammar-lexis specifications – as opposed to a root-&-pattern database –, is motivated both linguistically and with regards to efficiency, economy and modularity.

Standard arabic morphological analyzer (sama) version

  • 2009