Stop-Word Removal Algorithm and its Implementation for Sanskrit Language

@article{Raulji2016StopWordRA,
  title={Stop-Word Removal Algorithm and its Implementation for Sanskrit Language},
  author={Jaideepsinh K. Raulji and Jatinderkumar R. Saini},
  journal={International Journal of Computer Applications},
  year={2016},
  volume={150},
  pages={15-17}
}
In the Information era, optimization of processes for Information Retrieval, Text Summarization, Text and Data Analytic systems becomes utmost important. Therefore in order to achieve accuracy, extraction of redundant words with low or no semantic meaning must be filtered out. Such words are known as stopwords. Stopwords list has been developed for languages like English, Chinese, Arabic, Hindi, etc. Stopword list is also available for Sanskrit language. Stop-word removal is an important… 
Dynamic Stopword Removal for Sinhala Language
TLDR
The focus is to prove that the cut-off point depends on the source data and the machine learning algorithm, which will be proved by using Newton's iteration method of root finding algorithm.
Stopword Identification and Removal Techniques on TC and IR applications: A Survey
  • Dhara J. Ladani, N. Desai
  • Computer Science
    2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS)
  • 2020
TLDR
This paper discusses the various major stopword identification techniques used by the researchers in last few decades, for Indian Language and Non-Indian Languages, and presents a survey of methods used for stopword list generation with their characteristics.
Automatic Generation of Stopwords in the Amharic Text
TLDR
This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure.
Generating Javanese Stopwords List using K-means Clustering Algorithm
TLDR
This work has shown that a stopword list for low resources language such as Javanese is not available yet, and suggested that such a list should be developed in the near future.
Review on Natural Language Processing Trends and Techniques Using NLTK
TLDR
Common dialect handling procedures incorporates tokenization, stop word expulsion, stemming, lemmatization, parts of discourse labeling, lumping and named substance recognizer which enhances execution of NLP applications and the Natural Language Toolkit is the best possible solution for learning the ropes of N LP domain.
Functional words removal techniques: A review
  • S. Gandotra, B. Arora
  • Computer Science
    2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)
  • 2018
TLDR
All the stop-word removal techniques used for Indian text are discussed along with the analysis of results produced by using those techniques for various Indian languages is also presented.
Accelerating Text Mining Using Domain-Specific Stop Word Lists
TLDR
A novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach, which depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane to significantly reduce text dimensionality by eliminating irrelevant features.
Comparison of text preprocessing methods
TLDR
The pros and cons of several common text preprocessing methods are discussed: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions.
An Experimental Study of Text Preprocessing Techniques for Automatic Short Answer Grading in Indonesian
TLDR
This study aims to conduct an experimental study to measure the effectiveness of preprocessing techniques in Automatic Short Answer Grading (ASAG) using questions and answers in Indonesian, and measures the similarity values of teacher and student answers using the Cosine Similarity method.
Effect of stopwords in Indian language IR
TLDR
It is observed that the stopword removal generally improves mean average precision (MAP) significantly compared with the case when it is not done, and can be recommend, based on experiment, a number of stopwords for chosen Indian languages that are good enough from retrieval point of view.
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Stop-word removal algorithm for Arabic language
TLDR
The new Arabic removal stop-word technique has been tested using a set of 242 Arabic abstracts chosen from the Proceedings of the Saudi Arabian National Computer conferences, and another set of data choosing from the holy Q'uran, and it gives impressive results that reached approximately to 98%.
A Rule-Based Approach to Identify Stop Words for Gujarati Language
TLDR
For the first time in scientific community worldwide, a dynamic approach independent of all factors namely usage of file or dictionary, word-length,word-frequency, and training dataset is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words.
Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study
TLDR
The overall performance of a general stoplist was better than the other two lists, and stoplists improved retrieval effectiveness especially when used with the BM25 weight.
POS Word Class Based Categorization of Gurmukhi Language Stemmed Stop Words
TLDR
This paper concentrates on providing better and deeper understanding of Punjabi stop words in lieu of PunJabi grammar and part of speech based word class categorization.
Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle
TLDR
For the first time in scientific community dealing with computational linguistics and literature processing using NLP techniques, the list of 184 stop words of Punjabi language is released for public usage and further NLP applications.
Information Retrieval for Gujarati Language Using Cosine Similarity Based Vector Space Model
TLDR
This is first IR task in Gujarati language using cosine similarity based calculations using VSDM, widely used in information retrieval and document classification where each document is represented as a vector and each dimension corresponds to a separate term.
Avyaya Analyzer: Analysis of Indeclinables using Finite State Transducers
TLDR
A typical language parser can be divided into 3 components, viz., Morphological Analyzer, Local Word Grouper and Core Parser, which generate the parsed structure of the sentence.
Preprocessing Techniques for Text Mining-An Overview Dr
TLDR
This paper discussed about the text mining and its preprocessing techniques, a technique which extracts information from both structured and unstructured data and also finding patterns.
Hybrid Stop-Word Removal Technique for Arabic Language
...
...