On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-words Lists of International Languages☆

  title={On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-words Lists of International Languages☆},
  author={Jatinderkumar R. Saini and Rajnish M. Rakholia},
  journal={Procedia Computer Science},

Figures and Tables from this paper

Generating Stopword List for Sanskrit Language
The paper presents the first of its kind, a list of seventy-five generic stopwords of Sanskrit language extracted from a data amounting to nearly seventy-six thousand words, following a hybrid approach.
Dataset of stopwords extracted from Uzbek texts
Automatic Generation of Stopwords in the Amharic Text
This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure.
Stop-Word Removal Algorithm and its Implementation for Sanskrit Language
A simple approach is used to design stop-word removal algorithm and its implementation for Sanskrit language and the algorithm and the implementation uses dictionary based approach.
Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List
This research lays emphasis on the use of stop lemmas instead of stop words owing to the presence of various, but not all morphological forms of a word in stop word lists, as opposed to the Presence of only the root form of the word, from which variations could be derived if required.
DBTechVoc: A POS-tagged Vocabulary of Tokens and Lemmata of the Database Technical Domain
The empirical results, with more than 1000 high quality research papers collected over a period of 45 years from 1976 to 2021, prove that the technical general word list of the domain of computer science is different from the technical and specific word list for the technical domain of databases.
Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus
The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS) to solve the problem of polysemy.
Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique
The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format and forms synsets and in turn reduces dimensions to formulate a Document Synset Matrix forMarathi corpus.
Topic Modeling and Classification of Cyberspace Papers Using Text Mining
This study utilizes text mining algorithms to extract, validate, and analyze 1860 scientific articles on the cyberspace domain and provides insight over the future scientific directions or cybersspace studies.


POS Word Class Based Categorization of Gurmukhi Language Stemmed Stop Words
This paper concentrates on providing better and deeper understanding of Punjabi stop words in lieu of PunJabi grammar and part of speech based word class categorization.
Automatic identification of light stop words for Persian information retrieval systems
This paper proposes an automatic aggregated methodology based on term frequency, normalized inverse document frequency and information model to extract the light stop words from Persian text to reduce the number of index terms.
Automatic construction of Chinese stop word list
This paper proposes an automatic aggregated methodology based on statistical and information models for extraction of a stop word list in Chinese language, and shows that the list is much more general than other Chinese stop lists as well.
The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
The design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF) and the results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language.
Effective Listings of Function Stop words for Twitter
This paper will be examining the original work using term frequency, inverse document frequency and term adjacency for developing a stop words list for the Twitter data source, and proposing a new technique using combinatorial values as an alternative measure to effectively list out stop words.
Toward an ARABIC Stop-Words List Generation
A statistical approach is presented to extract Arabic stop-words list and results yield an improvement in an ANN based classifier using the generated stop- Words list over the general list.
A Study and Analysis of Opinion Mining Research in Indo-Aryan, Dravidian and Tibeto-Burman Language Families
A study and analysis of different languages used for emotion detection and sentiment analysis for formal and informal piece of writing in India and performance comparison of Indian languages with world language, that is, English.
A Study of Text Classification Natural Language Processing Algorithms for Indian Languages
This study shows that supervised learning algorithms (Naive Bayes (NB), Support Vector Machine (SVM), Artificial Neural Network (ANN), and N-gram) performed better for Text Classification task.
Structural Analysis of Username Segment in E-Mail Addresses of MCA Institutes of Gujarat State
It was found that the institutions tend to design the username segment of their e-mail addresses by choosing words or combination of words from specific categories, including special characters, digits and random words in designing the usernames.
A Textual Analysis of Digits Used for Designing Yahoo-Group Identifiers
A tremendous increase in the use of the Internet for online communication is witnessed worldwide. Yahoo! Inc., provides one such service in the form of Yahoo! Groups. Each such group is identified