An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages

  title={An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages},
  author={Prafulla Bharat Bafna and Jatinderkumar R.},
  journal={International Journal of Advanced Computer Science and Applications},
  • P. Bafna, Jatinderkumar R.
  • Published 2020
  • Computer Science
  • International Journal of Advanced Computer Science and Applications
Availability of the text in different languages has become possible, as almost all websites have offered multilingual option. Hindi is considered as official language in one of the states of India. Hindi text analysis is dominated by the corpus of stories and poems. Before performing any text analysis token extraction is an important step and supports many applications like text summarization , categorizing text and so on. Token extraction is a part of Natural language processing (NLP). NLP… 

Figures and Tables from this paper

Towards Natural Language Processing with Figures of Speech in Hindi Poetry
This work is the first of its kind in Hindi Natural Language Processing (NLP), which touches on the area of Hindi figure of speech and has created a systematic hierarchical structure of Hindi “Alankaar” types and sub-types and attempted and extended the work to identify a few.
Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus
The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS) to solve the problem of polysemy.
Stanza Type Identification using Systematization of Versification System of Hindi Poetry
The paper covers various challenges and the best possible solutions for those challenges, describing the methodology to generate automatic metadata for “Chhand” based on the poems’ stanzas, and provides some advanced information and techniques for metadata generation for ”Muktak Chhands”.
Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique
The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format and forms synsets and in turn reduces dimensions to formulate a Document Synset Matrix forMarathi corpus.
Hindi Poetry Classification using Eager Supervised Machine Learning Algorithms
Two eager machine learning algorithms are applied on the corpus containing 450 Hindi poems and poetry/poem gets classified based on terms present in it using a misclassification error.
Study presents a novel perspective in sentiment capture as of Gujarati Poems with the use of variety of characteristic there within Gujarati poems to disclose emotions through Gujarati poetries.
Toward a least-effort principle for evaluating prices of elements as indicators of sustainability
In this article, we use rank to understand the price of chemical elements. We observe that the role of the volume from global mining production dominates in materials economics. In this article, we


Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning
  • P. Bafna, Jatinderkumar R. Saini
  • Computer Science
    2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)
  • 2019
The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering, an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities.
Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List
This research lays emphasis on the use of stop lemmas instead of stop words owing to the presence of various, but not all morphological forms of a word in stop word lists, as opposed to the Presence of only the root form of the word, from which variations could be derived if required.
Context Specific Lexicon for Hindi Reviews
Marathi Text Analysis using Unsupervised Learning and Word Cloud
  • Computer Science
    International Journal of Engineering and Advanced Technology
  • 2020
Results prove the robustness of the proposed approach for Marathi Corpus, an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities.
Large-Scale Analysis of Zipf’s Law in English Texts
This work studies three different versions of Zipf’s law by fitting them to all available English texts in the Project Gutenberg database and finds one of them is able to fit more than 40% of thetexts in the database at the 0.05 significance level.
On Exhaustive Evaluation of Eager Machine Learning Algorithms for Classification of Hindi Verses
Text classification algorithms along with Natural Language Processing (NLP) facilitates fast, cost-effective, and scalable solution for classification and prediction of verses on Hindi corpus.
Empirical and Theoretical Bases of Zipf's Law
1Let us start by considering a basic form of Zipf's law. Suppose one has a natural-language corpus, e.g., a book written in English. Next, suppose one makes a frequency count of the words in the
Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe
We present an update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We provide
Latent semantic analysis for text categorization using neural network