Evaluating Various Tokenizers for Arabic Text Classification

  title={Evaluating Various Tokenizers for Arabic Text Classification},
  author={Zaid Alyafeai and Maged S. Al-shaibani and Mustafa Ghaleb and Irfan Ahmad},
The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic i.e they don’t… 

Figures and Tables from this paper

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

This survey connects several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated.

Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques

This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques and shows that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used.

Cloud computing architecture for Tagging Arabic Text Using Hybrid Model

This paper presents and deploys a cloud computing architecture for Tagging Arabic text using a hybrid model, which will help reduce the efforts and cost and show an excellent accuracy rate in tagging an Arabic text and quickly respond.

AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets

An Arabic text recognition approach is presented for detecting misogyny from Arabic tweets using the Arabic Levantine Twitter Dataset for Misogynistic, and seems to be useful in providing practical smart solutions for detecting Arabic misogyny on social media.



On the Importance of Tokenization in Arabic Embedding Models

This work proposes two embedding strategies that modify the tokenization phase of traditional word embedding models (Word2Vec and BERT) to take into account Arabic’s relatively complex morphology.

hULMonA: The Universal Language Model in Arabic

Developing the first Universal Language Model in Arabic (hULMonA) and demonstrating its use for Arabic classifications tasks, and demonstrating how a pre-trained multi-lingual BERT can also be used for Arabic.

AraBERT: Transformer-based Model for Arabic Language Understanding

This paper pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language, and showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.

The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation

This paper systematically compares neural and statistical MT models for Arabic-English translation on data preprecossed by various prominent tokenization schemes and shows that the best choice of tokenization scheme is largely based on the type of model and the size of data.

A comparative study of effective approaches for Arabic sentiment analysis

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

This work explores three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network).

Enriching Word Vectors with Subword Information

A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

CamemBERT: a Tasty French Language Model

This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.