Corpus ID: 2565481

Toward an ARABIC Stop-Words List Generation

@article{Alajmi2012TowardAA,
  title={Toward an ARABIC Stop-Words List Generation},
  author={Amal Alajmi and Elsayed M. Saad and R. R. Darwish},
  journal={International Journal of Computer Applications},
  year={2012},
  volume={46},
  pages={8-13}
}
Over the past decades systems for automatic management of electronic documents have been one of the main fields of research. Text processing is a wide area that includes many important disciplines. In the processes of organizing unstructured text in order to implement a mining technique, preprocessing has to be applied. One of the most important preprocessing techniques is the removal of functional words which affects the performance of text mining tasks. In this paper, a statistical approach… Expand
On Arabic Stop-Words: A Comprehensive List and a Dedicated Morphological Analyzer
TLDR
A new comprehensive Arabic stop-words list is compiled along a stop- Words analyzer that combines that list with a machine-learning-based approach to get the most probable stop-word. Expand
Automatic Generation of Stopwords in the Amharic Text
TLDR
This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure. Expand
HSRA: Hindi stopword removal algorithm
TLDR
This paper is proposing a stopword removal algorithm for Hindi Language which is using the concept of a Deterministic Finite Automata (DFA) which has been tested on 200 documents and achieved 99% accuracy and also time efficient. Expand
Generating Stopword List for Sanskrit Language
TLDR
The paper presents the first of its kind, a list of seventy-five generic stopwords of Sanskrit language extracted from a data amounting to nearly seventy-six thousand words, following a hybrid approach. Expand
PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information
TLDR
An aggregated method for automatically building stop-word lists for Persian information retrieval systems using part of speech tagging and analyzing statistical features of terms is proposed to enhance the accuracy of retrieval and minimize potential side effects of removing informative terms. Expand
Automatic identification of light stop words for Persian information retrieval systems
TLDR
This paper proposes an automatic aggregated methodology based on term frequency, normalized inverse document frequency and information model to extract the light stop words from Persian text to reduce the number of index terms. Expand
Stemming versus multi-words indexing for Arabic documents classification
TLDR
Empirical results on Arabic dataset reveal that the choice of extracted feature's type has a significant impact on conserving semantic information and improving classification accuracy, especially with the morphological complexity of the Arabic language. Expand
Accelerating Text Mining Using Domain-Specific Stop Word Lists
TLDR
A novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach, which depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane to significantly reduce text dimensionality by eliminating irrelevant features. Expand
An Automatic Construction of Malay Stop Words Based on Aggregation Method
TLDR
This study proposes an aggregation technique using three different approaches for an automatic construction of general Malay Stop words by considering words’ frequencies (highest and lowest) against their ranks, this method inspired by zipf’s law. Expand
Context aware stopwords for Sinhala Text classification
TLDR
The seven stopword identification methods previously applied to other languages are presented to remove stopwords and a new algorithm for building a domain-specific stopword list is proposed. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Research on the Construction and Filter Method of Stop-word List in Text Preprocessing
  • Zhou Yao, Cao Ze-wen
  • Computer Science
  • 2011 Fourth International Conference on Intelligent Computation Technology and Automation
  • 2011
TLDR
This paper summarized the definition, extraction principles and method of stop-word, and constructed a customizing Chinese-English stop- word list with the classical stop-words list based on the difference of text documents' domain. Expand
Automatic Identification of Stop Words in Chinese Text Classification
  • Lili Hao, Lizhu Hao
  • Computer Science
  • 2008 International Conference on Computer Science and Software Engineering
  • 2008
TLDR
This paper gives a refined definition for stop words in Chinese text classification from a perspective of statistical correlation, then proposes an automatic approach to extracting the stop word list in text classification based on the weighted Chi-squared statistic on 2*p contingency table and evaluates the stopword lists using accuracies obtained from text classification experiments in the real-world Chinese corpus. Expand
Arabic verb pattern extraction
  • El-Said Saad, M. Awadalla, A. Alajmi
  • Computer Science
  • 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010)
  • 2010
TLDR
A new method is presented for extracting Arabic text stem, and lemma based on pattern extraction that uses a special encoding based on dividing letters into original and non-original letters. Expand
Automatic construction of Chinese stop word list
TLDR
This paper proposes an automatic aggregated methodology based on statistical and information models for extraction of a stop word list in Chinese language, and shows that the list is much more general than other Chinese stop lists as well. Expand
Effective Stemming for Arabic Information Retrieval
Arabic has a very rich and complex morphology. Its appropriate morphological processing is very important for Information Retrieval (IR). In this paper, we propose a new stemming technique that triesExpand
Towards modernised and Web-specific stoplists for Web document analysis
  • M. P. Sinka, D. Corne
  • Computer Science
  • Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)
  • 2003
TLDR
Two new word-entropy based stoplists are developed: one derived from random Web pages, and onederived from the BankSearch dataset, which find that existing stoplists perform well, but are sometimes outperformed by new stoplists, especially on hard classification tasks. Expand
A Stemming Procedure and Stopword List for General French Corpora
  • J. Savoy
  • Computer Science
  • J. Am. Soc. Inf. Sci.
  • 1999
TLDR
The tools needed to establish a general stopword list and a stemming procedure in the French language databases are presented and some retrieval experiments that have been carried out using two mediumsized French language test collections are evaluated. Expand
The selection of Mongolian stop words
  • Gong Zheng, Guan Gaowa
  • Computer Science
  • 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems
  • 2010
TLDR
The main idea of this method is to obtain Mongolian stop words through algorithms used, and then remove the noun or verb as the main component in the sentences and Mongolian homonyms from the stop words. Expand
Hybrid Stop-Word Removal Technique for Arabic Language
Encyclopedia of Information Science and Technology
TLDR
This five-volume encyclopedia includes more than 550 articles highlighting current concepts, issues and emerging technologies that can be accessed by scholars, students, and researchers in the field of information science and technology. Expand
...
1
2
3
4
...