Automatic Generation of Stopwords in the Amharic Text

  title={Automatic Generation of Stopwords in the Amharic Text},
  author={Sileshi Girmaw Miretie and Vijay M. Khedkar},
  journal={International Journal of Computer Applications},
For the retrieval of information from documents of different natural languages, pre-processing of the document is the main task. During pre-processing, words which occur too frequently and have little semantic in the document should be identified. Such words are called Stopwords. Stopwords list for different world languages like English, Chinese, Hindi, Arabic Sanskrit etc. are identified. But as I long as I know there is no standard method to identify these words for the Amharic language. In… 

Stopword Identification and Removal Techniques on TC and IR applications: A Survey

  • Dhara J. LadaniN. Desai
  • Computer Science
    2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)
  • 2020
This paper discusses the various major stopword identification techniques used by the researchers in last few decades, for Indian Language and Non-Indian Languages, and presents a survey of methods used for stopword list generation with their characteristics.

Effect of stopwords in Indian language IR

It is observed that the stopword removal generally improves mean average precision (MAP) significantly compared with the case when it is not done, and can be recommend, based on experiment, a number of stopwords for chosen Indian languages that are good enough from retrieval point of view.

Accelerating Text Mining Using Domain-Specific Stop Word Lists

A novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach, which depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane to significantly reduce text dimensionality by eliminating irrelevant features.

Bengali Stop Word and Phrase Detection Mechanism

This research innovates the definition and classification of Bengali stop words and phrases and implements two approaches to identify them, first one a corpus-based approach, while the second one is based on the finite-state automaton.

Indonesian Journal of Electrical Engineering and Computer Science

This work presents the development of phasmophobia detection electroencephalogram database (PDED), which consists of an average of 45 minutes electroencephalography (EEG) recordings from eight electrodes situated on the frontal lobe of the brain area and indicates that two electrodes were sufficient to recognized 88% of fear from the recordings.



Generating Stopword List for Sanskrit Language

The paper presents the first of its kind, a list of seventy-five generic stopwords of Sanskrit language extracted from a data amounting to nearly seventy-six thousand words, following a hybrid approach.

Stop-Word Removal Algorithm and its Implementation for Sanskrit Language

A simple approach is used to design stop-word removal algorithm and its implementation for Sanskrit language and the algorithm and the implementation uses dictionary based approach.

HSRA: Hindi stopword removal algorithm

This paper is proposing a stopword removal algorithm for Hindi Language which is using the concept of a Deterministic Finite Automata (DFA) which has been tested on 200 documents and achieved 99% accuracy and also time efficient.

A Rule-Based Approach to Identify Stop Words for Gujarati Language

For the first time in scientific community worldwide, a dynamic approach independent of all factors namely usage of file or dictionary, word-length,word-frequency, and training dataset is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words.

Automatic Identification of Chinese Stop Words

Results show that the generated stop word list can improve the accuracy of Chinese segmentation significantly and save the time and release the burden of manual stop word selection.

PSWG: An automatic stop-word list generator for Persian information retrieval systems based on similarity function & POS information

An aggregated method for automatically building stop-word lists for Persian information retrieval systems using part of speech tagging and analyzing statistical features of terms is proposed to enhance the accuracy of retrieval and minimize potential side effects of removing informative terms.

Entropy-Based Generic Stopwords List for Yoruba Texts

This research employed entropy based algorithm to identify stopwords candidate for Yoruba Language texts. Two sets of corpus of 756,039 Yoruba words were used; the diacritized and its undiacritized

The automatic identification of stop words

It is shown how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure and it becomes possible to identify the stop words in a cullectmn by automated statistical testing.

A supervised approach to distinguish between keywords and stopwords using probability distribution functions

  • Aditi SharanSifatullah Siddiqi
  • Computer Science, Economics
    2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
  • 2014
A novel probability based approach for distinguishing between keyword and stopword from a text corpus that is corpus base, supervised and computationally very efficient and independent of the language used.

Automatically generation and evaluation of Stop words list for Chinese Patents

  • Deng NaChen Xu
  • Computer Science
  • 2015
The experiment result indicates that both of these two methodologies can extract the stop words suitable for Chinese patents and the accuracy of Methodology based on statistics is a little higher than the one based on word frequency.