Generating Javanese Stopwords List using K-means Clustering Algorithm

@article{Wibawa2020GeneratingJS,
  title={Generating Javanese Stopwords List using K-means Clustering Algorithm},
  author={Aji Prasetya Wibawa and Hidayah Kariima Fithri and Ilham Ari Elbaith Zaeni and Andrew Nafalski},
  journal={Knowl. Eng. Data Sci.},
  year={2020},
  volume={3},
  pages={106-111}
}
Text processing in Information Retrieval (IR) requires text documents as primary data sources. However, not all words in the text document are used. Some words often appear in text documents and do not have meaning called stopword [1], stored in a stopword list called a stopword database (corpus) [2][3]. The stopword removal approach depends on this Corpus to remove unnecessary words on the text [4]. The formed word list must be in the same language [1][5]. Various stopword list has been… 

Figures and Tables from this paper

K-Medoids Clustering untuk Pembentukan Database Stopword Bahasa Jawa

The results of this study suggest that the stopword produced by k-medoids clustering with a value of K=13 has an accuracy of 70.5%.

Automatic Knowledge Integration Method of English Translation Corpus Based on Kmeans Algorithm

The experimental results prove that the text features extracted by the sparse autoencoder based on the Kmeans algorithm can be used for English translation corpus knowledge clustering to achieve automatic integration.

Convolutional Neural Network (CNN) to determine the character of wayang kulit

Wayang Kulit is a traditional Indonesian art genre that has been designated as a "Masterpiece of Oral Intangible Heritage of Humanity" by UNESCO [1], [2]. Wayang Kulit has a variety of names and

References

SHOWING 1-10 OF 21 REFERENCES

Stop-Word Removal Algorithm and its Implementation for Sanskrit Language

A simple approach is used to design stop-word removal algorithm and its implementation for Sanskrit language and the algorithm and the implementation uses dictionary based approach.

Context aware stopwords for Sinhala Text classification

The seven stopword identification methods previously applied to other languages are presented to remove stopwords and a new algorithm for building a domain-specific stopword list is proposed.

STOPWORDS REMOVAL AND ITS ALGORITHMS BASED ON DIFFERENT METHODS

  • J. Kaur
  • Computer Science
    International Journal of Advanced Research in Computer Science
  • 2018
The main goal of this thesis is to remove the stopwords in Punjabi language by using different techniques and the size of the document is reduced by 30-35% by eliminating the set of such stopwords.

Automatic construction of Chinese stop word list

This paper proposes an automatic aggregated methodology based on statistical and information models for extraction of a stop word list in Chinese language, and shows that the list is much more general than other Chinese stop lists as well.

A Rule-Based Approach to Identify Stop Words for Gujarati Language

For the first time in scientific community worldwide, a dynamic approach independent of all factors namely usage of file or dictionary, word-length,word-frequency, and training dataset is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words.

Automatically Building a Stopword List for an Information Retrieval System

This paper presents different methods in deriving a stopword list automatically for a given collection and evaluates the results using four different standard TREC collections and introduces a new approach, called term-based random sampling, based on the Kullback-Leibler divergence measure, which enables us to derive a stopwords list automatically.

Analysis of TF-IDF Model and its Variant for Document Retrieval

The result shows that TF-IDF model gives the highest precision values with the new corpus dataset, and is carried out to analyze and evaluate the retrieval effectiveness of vector -- space model while using the new data set of FIRE 2011.

When stopword lists make the difference

It is shown that through implementing the original Okapi form or certain ones derived from the Divergence from Randomness (DFR) paradigm, significantly lower performance levels may result when using short or no stopword lists.

Preprocessing Techniques for Text Mining-An Overview Dr

This paper discussed about the text mining and its preprocessing techniques, a technique which extracts information from both structured and unstructured data and also finding patterns.

Arabic Sentiment Analysis Using Supervised Classification

  • R. DuwairiIslam Qarqaz
  • Computer Science
    2014 International Conference on Future Internet of Things and Cloud
  • 2014
Three classifiers were applied on an in-house developed dataset of tweets/comments and the results show that SVM gives the highest precision while KNN (K=10) gives thehighest Recall.