A Rule-Based Approach to Identify Stop Words for Gujarati Language

  title={A Rule-Based Approach to Identify Stop Words for Gujarati Language},
  author={Rajnish M. Rakholia and Jatinderkumar R. Saini},
Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming… 
Stop-Word Removal Algorithm and its Implementation for Sanskrit Language
A simple approach is used to design stop-word removal algorithm and its implementation for Sanskrit language and the algorithm and the implementation uses dictionary based approach.
Automatic Generation of Stopwords in the Amharic Text
This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure.
Stopword Identification and Removal Techniques on TC and IR applications: A Survey
  • Dhara J. Ladani, N. Desai
  • Computer Science
    2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)
  • 2020
This paper discusses the various major stopword identification techniques used by the researchers in last few decades, for Indian Language and Non-Indian Languages, and presents a survey of methods used for stopword list generation with their characteristics.
Generating Javanese Stopwords List using K-means Clustering Algorithm
This work has shown that a stopword list for low resources language such as Javanese is not available yet, and suggested that such a list should be developed in the near future.
DBTechVoc: A POS-tagged Vocabulary of Tokens and Lemmata of the Database Technical Domain
The empirical results, with more than 1000 high quality research papers collected over a period of 45 years from 1976 to 2021, prove that the technical general word list of the domain of computer science is different from the technical and specific word list for the technical domain of databases.
Dynamic Phrase Generation for Detection of Idioms of Gujarati Language using Diacritics and Suffix-based Rules
Gujarati is the language used for everyday communication in the state of Gujarat, India. The Gujarati language is also officially recognized by the constitution and the government of India. Gujarati
Accelerating Text Mining Using Domain-Specific Stop Word Lists
A novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach, which depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane to significantly reduce text dimensionality by eliminating irrelevant features.
Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus
The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS) to solve the problem of polysemy.
Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique
The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format and forms synsets and in turn reduces dimensions to formulate a Document Synset Matrix forMarathi corpus.
Indonesian Journal of Electrical Engineering and Computer Science
This work presents the development of phasmophobia detection electroencephalogram database (PDED), which consists of an average of 45 minutes electroencephalography (EEG) recordings from eight electrodes situated on the frontal lobe of the brain area and indicates that two electrodes were sufficient to recognized 88% of fear from the recordings.


The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
The design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF) and the results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language.
Toward an ARABIC Stop-Words List Generation
A statistical approach is presented to extract Arabic stop-words list and results yield an improvement in an ANN based classifier using the generated stop- Words list over the general list.
Research on the Construction and Filter Method of Stop-word List in Text Preprocessing
  • Zhou Yao, Cao Ze-wen
  • Computer Science
    2011 Fourth International Conference on Intelligent Computation Technology and Automation
  • 2011
This paper summarized the definition, extraction principles and method of stop-word, and constructed a customizing Chinese-English stop- word list with the classical stop-words list based on the difference of text documents' domain.
To stop or not to stop — Experiments on stopword elimination for information retrieval of Gujarati text documents
Results show that elimination of stopwords improve the MAP values of Gujarati IR, which is the metric used to measure the efficiency of information retrieval tasks.
Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval
The experimental investigation suggests that stop word removal improves retrieval significantly, however, there is a small drop in retrieval precision with all the three stemmers.
Design and Development of Stemmer for Tamil Language: Cluster Analysis
Improved light stemming algorithm for getting stemmed Tamil word with less computational steps is proposed and shows that the words stemmed after clustering gives better result compared to words stemmed before clustering.
A Study and Comparative Analysis of Different Stemmer and Character Recognition Algorithms for Indian Gujarati Script
A literature review on stemmer, optical character recognition (OCR) and Text mining work on Indian scripts, mainly on the Gujarati languages is presented.
The selection of Mongolian stop words
  • Gong Zheng, Guan Gaowa
  • Computer Science
    2010 IEEE International Conference on Intelligent Computing and Intelligent Systems
  • 2010
The main idea of this method is to obtain Mongolian stop words through algorithms used, and then remove the noun or verb as the main component in the sentences and Mongolian homonyms from the stop words.
Pre-processing of Domain Ontology Graph Generation System in Punjabi
This research paper focuses on pre-processing of Punjabi text documents, which includes allowing input restrictions to the text, removal of special symbols and punctuation marks, and removal of duplicate terms from ontology graph generation system.