Stop-Word Removal Algorithm and its Implementation for Sanskrit Language

  title={Stop-Word Removal Algorithm and its Implementation for Sanskrit Language},
  author={Jaideepsinh K. Raulji and Jatinderkumar R. Saini},
  journal={International Journal of Computer Applications},
In the Information era, optimization of processes for Information Retrieval, Text Summarization, Text and Data Analytic systems becomes utmost important. Therefore in order to achieve accuracy, extraction of redundant words with low or no semantic meaning must be filtered out. Such words are known as stopwords. Stopwords list has been developed for languages like English, Chinese, Arabic, Hindi, etc. Stopword list is also available for Sanskrit language. Stop-word removal is an important… 

Dynamic Stopword Removal for Sinhala Language

The focus is to prove that the cut-off point depends on the source data and the machine learning algorithm, which will be proved by using Newton's iteration method of root finding algorithm.

Stopword Identification and Removal Techniques on TC and IR applications: A Survey

  • Dhara J. LadaniN. Desai
  • Computer Science
    2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)
  • 2020
This paper discusses the various major stopword identification techniques used by the researchers in last few decades, for Indian Language and Non-Indian Languages, and presents a survey of methods used for stopword list generation with their characteristics.

Automatic Generation of Stopwords in the Amharic Text

This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure.

Generating Javanese Stopwords List using K-means Clustering Algorithm

This work has shown that a stopword list for low resources language such as Javanese is not available yet, and suggested that such a list should be developed in the near future.

Review on Natural Language Processing Trends and Techniques Using NLTK

Common dialect handling procedures incorporates tokenization, stop word expulsion, stemming, lemmatization, parts of discourse labeling, lumping and named substance recognizer which enhances execution of NLP applications and the Natural Language Toolkit is the best possible solution for learning the ropes of N LP domain.

Functional words removal techniques: A review

  • S. GandotraB. Arora
  • Computer Science
    2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)
  • 2018
All the stop-word removal techniques used for Indian text are discussed along with the analysis of results produced by using those techniques for various Indian languages is also presented.

Accelerating Text Mining Using Domain-Specific Stop Word Lists

A novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach, which depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane to significantly reduce text dimensionality by eliminating irrelevant features.

Comparison of text preprocessing methods

The pros and cons of several common text preprocessing methods are discussed: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions.

An Experimental Study of Text Preprocessing Techniques for Automatic Short Answer Grading in Indonesian

This study aims to conduct an experimental study to measure the effectiveness of preprocessing techniques in Automatic Short Answer Grading (ASAG) using questions and answers in Indonesian, and measures the similarity values of teacher and student answers using the Cosine Similarity method.

Automatic Multilingual Stopwords Identification from Very Small Corpora

This paper proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents), and proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice.



Stop-word removal algorithm for Arabic language

The new Arabic removal stop-word technique has been tested using a set of 242 Arabic abstracts chosen from the Proceedings of the Saudi Arabian National Computer conferences, and another set of data choosing from the holy Q'uran, and it gives impressive results that reached approximately to 98%.

A Rule-Based Approach to Identify Stop Words for Gujarati Language

For the first time in scientific community worldwide, a dynamic approach independent of all factors namely usage of file or dictionary, word-length,word-frequency, and training dataset is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words.

Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study

The overall performance of a general stoplist was better than the other two lists, and stoplists improved retrieval effectiveness especially when used with the BM25 weight.

POS Word Class Based Categorization of Gurmukhi Language Stemmed Stop Words

This paper concentrates on providing better and deeper understanding of Punjabi stop words in lieu of PunJabi grammar and part of speech based word class categorization.

A Natural Language Processing Approach for Identification of Stop Words in Punjabi Language

This paper concentrates on identification of stop words from poetry and other news articles and discusses the importance of each sub-phase in Punjabi poetry.

Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle

For the first time in scientific community dealing with computational linguistics and literature processing using NLP techniques, the list of 184 stop words of Punjabi language is released for public usage and further NLP applications.

Information Retrieval for Gujarati Language Using Cosine Similarity Based Vector Space Model

This is first IR task in Gujarati language using cosine similarity based calculations using VSDM, widely used in information retrieval and document classification where each document is represented as a vector and each dimension corresponds to a separate term.

Avyaya Analyzer: Analysis of Indeclinables using Finite State Transducers

A typical language parser can be divided into 3 components, viz., Morphological Analyzer, Local Word Grouper and Core Parser, which generate the parsed structure of the sentence.

Preprocessing Techniques for Text Mining-An Overview Dr

This paper discussed about the text mining and its preprocessing techniques, a technique which extracts information from both structured and unstructured data and also finding patterns.