Accelerating Text Mining Using Domain-Specific Stop Word Lists

  title={Accelerating Text Mining Using Domain-Specific Stop Word Lists},
  author={Farah Alshanik and Amy W. Apon and Alexander Herzog and Ilya Safro and Justin Sybrandt},
  journal={2020 IEEE International Conference on Big Data (Big Data)},
Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domain… 

Proactive Query Expansion for Streaming Data Using External Source

Results indicate that adding words from the secondary stream can significantly improve the quality of search queries and return more relevant information that covers a certain topic.

Domain-specific Stop Words in Malaysian Parliamentary Debates 1959 – 2018

Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay.

List of Papers

Notation for Publications ML1 – Machine Learning/Data Mining, GA – Graph Algorithms/Network Science, NLP – Natural Language Processing/Text Mining, MS – Multiscale Methods, QC – Quantum Computing,



Stop-Word Removal Algorithm and its Implementation for Sanskrit Language

A simple approach is used to design stop-word removal algorithm and its implementation for Sanskrit language and the algorithm and the implementation uses dictionary based approach.

Automatic Identification of Stop Words in Chinese Text Classification

  • Lili HaoLizhu Hao
  • Computer Science
    2008 International Conference on Computer Science and Software Engineering
  • 2008
This paper gives a refined definition for stop words in Chinese text classification from a perspective of statistical correlation, then proposes an automatic approach to extracting the stop word list in text classification based on the weighted Chi-squared statistic on 2*p contingency table and evaluates the stopword lists using accuracies obtained from text classification experiments in the real-world Chinese corpus.

HSRA: Hindi stopword removal algorithm

This paper is proposing a stopword removal algorithm for Hindi Language which is using the concept of a Deterministic Finite Automata (DFA) which has been tested on 200 documents and achieved 99% accuracy and also time efficient.

Towards modernised and Web-specific stoplists for Web document analysis

  • M. P. SinkaD. Corne
  • Computer Science
    Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)
  • 2003
Two new word-entropy based stoplists are developed: one derived from random Web pages, and onederived from the BankSearch dataset, which find that existing stoplists perform well, but are sometimes outperformed by new stoplists, especially on hard classification tasks.

Automatic Construction of Generic Stop Words List for Hindi Text

Automatic Generation of Stopwords in the Amharic Text

This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure.

Toward an ARABIC Stop-Words List Generation

A statistical approach is presented to extract Arabic stop-words list and results yield an improvement in an ANN based classifier using the generated stop- Words list over the general list.

A Rule-Based Approach to Identify Stop Words for Gujarati Language

For the first time in scientific community worldwide, a dynamic approach independent of all factors namely usage of file or dictionary, word-length,word-frequency, and training dataset is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words.

Automatically Building a Stopword List for an Information Retrieval System

This paper presents different methods in deriving a stopword list automatically for a given collection and evaluates the results using four different standard TREC collections and introduces a new approach, called term-based random sampling, based on the Kullback-Leibler divergence measure, which enables us to derive a stopwords list automatically.

Automatic construction of Chinese stop word list

This paper proposes an automatic aggregated methodology based on statistical and information models for extraction of a stop word list in Chinese language, and shows that the list is much more general than other Chinese stop lists as well.