Corpus ID: 12951274

1.5 billion words Arabic Corpus

@article{ElKhair201615BW,
  title={1.5 billion words Arabic Corpus},
  author={Ibrahim Abu El-Khair},
  journal={ArXiv},
  year={2016},
  volume={abs/1611.04033}
}
This study is an attempt to build a contemporary linguistic corpus for Arabic language. [...] Key Method The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.Expand
An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
TLDR
Different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks. Expand
NADA: New Arabic Dataset for Text Classification
TLDR
A New Arabic Dataset (NADA) for Text Categorization purpose is proposed, composed of two existing corpora OSAC and DAA and organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. Expand
TOWARDS A ROBUST UNDERSTANDING OF ARABIC WORD SENSE
Word Sense Disambiguation (WSD) is a task which aims to identify the meaning of a word given its context. This problem has been investigated and analyzed in depth in English. However, work in ArabicExpand
Evaluating Various Tokenizers for Arabic Text Classification
TLDR
This paper introduces three new tokenization algorithms for Arabic and compares them to three other baselines using unsupervised evaluations and shows that the performance of suchtokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset. Expand
Improving Sentiment Analysis in Arabic Using Word Representation
TLDR
This paper describes how to construct Word2Vec models from a large Arabic corpus obtained from ten newspapers in different Arab countries, and reports improved accuracy of sentiment classification on the publicly available Arabic language health sentiment dataset. Expand
Arabic Text Classification of News Articles Using Classical Supervised Classifiers
TLDR
An ensemble model to combine best classifiers together in a majority-voting classifier to automatically identify the category of a document based on its linguistic features is implemented. Expand
AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
TLDR
This paper presents how AraCust was constructed, cleaned, pre-processed, and annotated, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA), and its power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of the corpus. Expand
Arabic sentiment analysis using recurrent neural networks: a review
TLDR
A systematic examination of the literature is presented to label, evaluate, and identify state-of-the-art studies using RNNs for Arabic sentiment analysis. Expand
Sensitivity of Arabic Sentiment Analysis Tools
  • Brian Conlon, P. Brenner
  • Computer Science
  • 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS)
  • 2020
TLDR
This work reviews some of the unique challenges inherent in the Arabic language that contribute to this accuracy lag beyond the scale-based economic drivers propelling enhanced accuracy in other languages, and attempts to provide a normalized basis for sentiment polarity. Expand
A Multi-Embeddings Approach Coupled with Deep Learning for Arabic Named Entity Recognition
TLDR
This work investigates the performance of pooled contextual embeddings and bidirectional encoder representations from Transformers (BERT) model when used for NER on the Arabic language while addressing Arabic specific issues. Expand
...
1
2
...

References

SHOWING 1-10 OF 32 REFERENCES
Building A Modern Standard Arabic Corpus
TLDR
The results of experiments in building a corpus for Modern Standard Arabic using data available on the World Wide Web are presented and the completeness and the representatives of this corpus are demonstrated to show its suitability for Language Engineering experiments. Expand
The Absence of Arabic Corpus Linguistics: A Call for Creating an Arabic National Corpus
TLDR
This concise research calls for creating an Arabic National Corpus (ANC) based on four-step design: planning the corpus, collecting the data, computerizing the corpus and analyzing the corpus. Expand
A 700M+ Arabic corpus: KACST Arabic corpus design and construction
TLDR
The King Abdulaziz City for Science and Technology (KACST) Arabic corpus is introduced, which was designed and created to overcome the limitations of existing Arabic corpora. Expand
The International Corpus of Arabic: Compilation, Analysis and Evaluation
This paper focuses on a project for building the first International Corpus of Arabic (ICA). It is planned to contain 100 million analyzed tokens with an interface which allows users to interact withExpand
Developing Tools for Arabic Corpus for Researchers
This paper presents an ongoing research that aims to construct a sizable and reliable text corpus along with a set of tools to experiment with natural language applications for Arabic. The corpus isExpand
The design of a corpus of Contemporary Arabic
TLDR
The survey of the needs of teachers of Arabic as a foreign language (TAFL) and language engineers shows that a wide range of text types should be included in the corpus, confirming the view that existing corpora are too narrowly limited in source-type and genre. Expand
Critical Survey of the Freely Available Arabic Corpora
TLDR
The results of a recent survey conducted to identify the list of the freely available Arabic corpora and language resources are presented and they are presented in the various categories studied. Expand
KALIMAT a multipurpose Arabic corpus
TLDR
The idea of generating an Arabic multipurpose corpus, which is called KALIMAT (Arabic transliteration of “WORDS”), which could benefit researchers working on different Arabic NLP areas is presented. Expand
Evaluation of Topic Identification Methods on Arabic Corpora
TLDR
Comparison results of six text categorization methods on a new Arabic corpus Alwatan-2004 indicate that TR-Classifier is the most efficient among the set of classifiers, nevertheless, only binary SVM outperformed it thanks to its characteristics. Expand
Comparison of Topic Identification methods for Arabic Language
TLDR
The results are encouraging both for SVM and TFIDF classifier, however the superiority of the SVM classifier and its high capability to distinguish topics is noticed. Expand
...
1
2
3
4
...