On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-words Lists of International Languages☆
@article{Saini2016OnCA, title={On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-words Lists of International Languages☆}, author={Jatinderkumar R. Saini and Rajnish M. Rakholia}, journal={Procedia Computer Science}, year={2016}, volume={89}, pages={313-319} }
20 Citations
Generating Stopword List for Sanskrit Language
- Computer Science2017 IEEE 7th International Advance Computing Conference (IACC)
- 2017
The paper presents the first of its kind, a list of seventy-five generic stopwords of Sanskrit language extracted from a data amounting to nearly seventy-six thousand words, following a hybrid approach.
Automatic Generation of Stopwords in the Amharic Text
- Computer Science
- 2018
This paper proposed the automatic identification of Stopwords for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure.
Stop-Word Removal Algorithm and its Implementation for Sanskrit Language
- Computer Science
- 2016
A simple approach is used to design stop-word removal algorithm and its implementation for Sanskrit language and the algorithm and the implementation uses dictionary based approach.
Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List
- LinguisticsInternational Journal of Advanced Computer Science and Applications
- 2020
This research lays emphasis on the use of stop lemmas instead of stop words owing to the presence of various, but not all morphological forms of a word in stop word lists, as opposed to the Presence of only the root form of the word, from which variations could be derived if required.
DBTechVoc: A POS-tagged Vocabulary of Tokens and Lemmata of the Database Technical Domain
- Computer ScienceInternational Journal of Advanced Computer Science and Applications
- 2022
The empirical results, with more than 1000 high quality research papers collected over a period of 45 years from 1976 to 2021, prove that the technical general word list of the domain of computer science is different from the technical and specific word list for the technical domain of databases.
Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus
- Computer Science
- 2020
The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS) to solve the problem of polysemy.
Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique
- Computer Science
- 2020
The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format and forms synsets and in turn reduces dimensions to formulate a Document Synset Matrix forMarathi corpus.
Kāvi: An Annotated Corpus of Punjabi Poetry with Emotion Detection Based on ‘Navrasa’
- Computer Science
- 2020
Topic Modeling and Classification of Cyberspace Papers Using Text Mining
- Computer Science
- 2018
This study utilizes text mining algorithms to extract, validate, and analyze 1860 scientific articles on the cyberspace domain and provides insight over the future scientific directions or cybersspace studies.
References
SHOWING 1-10 OF 30 REFERENCES
POS Word Class Based Categorization of Gurmukhi Language Stemmed Stop Words
- Linguistics
- 2016
This paper concentrates on providing better and deeper understanding of Punjabi stop words in lieu of PunJabi grammar and part of speech based word class categorization.
Automatic identification of light stop words for Persian information retrieval systems
- Computer ScienceJ. Inf. Sci.
- 2014
This paper proposes an automatic aggregated methodology based on term frequency, normalized inverse document frequency and information model to extract the light stop words from Persian text to reduce the number of index terms.
Automatic construction of Chinese stop word list
- Computer Science
- 2006
This paper proposes an automatic aggregated methodology based on statistical and information models for extraction of a stop word list in Chinese language, and shows that the list is much more general than other Chinese stop lists as well.
The design and implementation of diacritic extraction technique for Gujarati written script using Unicode Transformation Format
- Computer Science2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT)
- 2015
The design and implementation of diacritic extraction for the Gujarati script by using the Unicode Transformation Format (UTF) and the results are encouraging enough to make the proposed implementation viable for NLP tasks in Gujarati language.
Effective Listings of Function Stop words for Twitter
- Computer ScienceArXiv
- 2012
This paper will be examining the original work using term frequency, inverse document frequency and term adjacency for developing a stop words list for the Twitter data source, and proposing a new technique using combinatorial values as an alternative measure to effectively list out stop words.
Toward an ARABIC Stop-Words List Generation
- Computer Science
- 2012
A statistical approach is presented to extract Arabic stop-words list and results yield an improvement in an ANN based classifier using the generated stop- Words list over the general list.
A Study and Analysis of Opinion Mining Research in Indo-Aryan, Dravidian and Tibeto-Burman Language Families
- Linguistics, Computer Science
- 2014
A study and analysis of different languages used for emotion detection and sentiment analysis for formal and informal piece of writing in India and performance comparison of Indian languages with world language, that is, English.
A Study of Text Classification Natural Language Processing Algorithms for Indian Languages
- Computer Science
- 2015
This study shows that supervised learning algorithms (Naive Bayes (NB), Support Vector Machine (SVM), Artificial Neural Network (ANN), and N-gram) performed better for Text Classification task.
Structural Analysis of Username Segment in E-Mail Addresses of MCA Institutes of Gujarat State
- Computer Science
- 2010
It was found that the institutions tend to design the username segment of their e-mail addresses by choosing words or combination of words from specific categories, including special characters, digits and random words in designing the usernames.
A Textual Analysis of Digits Used for Designing Yahoo-Group Identifiers
- Computer Science, Mathematics
- 2010
A tremendous increase in the use of the Internet for online communication is witnessed worldwide. Yahoo! Inc., provides one such service in the form of Yahoo! Groups. Each such group is identified…