An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages

  title={An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages},
  author={Prafulla Bharat Bafna and Jatinderkumar R.},
  journal={International Journal of Advanced Computer Science and Applications},
  • P. Bafna, Jatinderkumar R.
  • Published 2020
  • Computer Science
  • International Journal of Advanced Computer Science and Applications
Availability of the text in different languages has become possible, as almost all websites have offered multilingual option. Hindi is considered as official language in one of the states of India. Hindi text analysis is dominated by the corpus of stories and poems. Before performing any text analysis token extraction is an important step and supports many applications like text summarization , categorizing text and so on. Token extraction is a part of Natural language processing (NLP). NLP… 

Figures and Tables from this paper

Towards Natural Language Processing with Figures of Speech in Hindi Poetry
This work is the first of its kind in Hindi Natural Language Processing (NLP), which touches on the area of Hindi figure of speech and has created a systematic hierarchical structure of Hindi “Alankaar” types and sub-types and attempted and extended the work to identify a few.
Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus
The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS) to solve the problem of polysemy.
Stanza Type Identification using Systematization of Versification System of Hindi Poetry
The paper covers various challenges and the best possible solutions for those challenges, describing the methodology to generate automatic metadata for “Chhand” based on the poems’ stanzas, and provides some advanced information and techniques for metadata generation for ”Muktak Chhands”.
Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique
The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format and forms synsets and in turn reduces dimensions to formulate a Document Synset Matrix forMarathi corpus.
Hindi Poetry Classification using Eager Supervised Machine Learning Algorithms
Two eager machine learning algorithms are applied on the corpus containing 450 Hindi poems and poetry/poem gets classified based on terms present in it using a misclassification error.
Study presents a novel perspective in sentiment capture as of Gujarati Poems with the use of variety of characteristic there within Gujarati poems to disclose emotions through Gujarati poetries.
Toward a least-effort principle for evaluating prices of elements as indicators of sustainability
In this article, we use rank to understand the price of chemical elements. We observe that the role of the volume from global mining production dominates in materials economics. In this article, we


Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning
  • P. BafnaJatinderkumar R. Saini
  • Computer Science
    2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)
  • 2019
The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering, an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities.
Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List
This research lays emphasis on the use of stop lemmas instead of stop words owing to the presence of various, but not all morphological forms of a word in stop word lists, as opposed to the Presence of only the root form of the word, from which variations could be derived if required.
Context Specific Lexicon for Hindi Reviews
Marathi Text Analysis using Unsupervised Learning and Word Cloud
  • Computer Science
    International Journal of Engineering and Advanced Technology
  • 2020
Results prove the robustness of the proposed approach for Marathi Corpus, an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities.
On Exhaustive Evaluation of Eager Machine Learning Algorithms for Classification of Hindi Verses
Text classification algorithms along with Natural Language Processing (NLP) facilitates fast, cost-effective, and scalable solution for classification and prediction of verses on Hindi corpus.
Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe
We present an update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We provide
Predicting Sensitivity of Local News Articles from Odia Dailies
Positive, negative and neutral local news is categorized and prediction of sensitivity from negative local news articles is predicted to set priority of action to be taken by the local administration.
Document clustering: TF-IDF approach
Term Frequency-Inverse Document Frequency algorithm is used along with fuzzy K-means and hierarchical algorithm along with different clusters of the related documents the resulted silhouette coefficient, entropy and F-measure trend are presented to show algorithm behavior for each data set.
On Readability Metrics of Goal Statements of Universities and Brand-Promoting Lexicons for Industries
The correlation between the found lexicons and the revenues generated by the considered companies is advocated and Pearson's correlation coefficient and Flesch Readability Index are deployed for the calculation of various metrics to form the basis of the conclusions.