Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique

  title={Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique},
  author={Prafulla Bharat Bafna and R Jatinderkumar},
  journal={International Journal of Advanced Computer Science and Applications},
  • P. BafnaR. Jatinderkumar
  • Published 2020
  • Computer Science
  • International Journal of Advanced Computer Science and Applications
Textual data is increasing exponentially and to extract the required information from the text, different techniques are being researched. Some of these techniques require the data to be presented in the tabular or matrix format. The proposed approach designs the Document Term Matrix for Marathi (DTMM) corpus and converts unstructured data into a tabular format. This approach has been called DTMM in this paper and it fails to consider the semantics of the terms. We propose another approach that… 
2 Citations

Figures and Tables from this paper

Measuring the Similarity between the Sanskrit Documents using the Context of the Corpus

The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS) to solve the problem of polysemy.

Sensed-Lexicon based Approach for Identification of Similarity among Punjabi Documents

Results revealed that on the basis of majority voting, combination of stop word removal with stemming and ‘noun’ based synonym replacement leads to the best combination with bi-gram tokens.



Marathi Text Analysis using Unsupervised Learning and Word Cloud

  • Computer Science
    International Journal of Engineering and Advanced Technology
  • 2020
Results prove the robustness of the proposed approach for Marathi Corpus, an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities.

Hindi Multi-document Word Cloud based Summarization through Unsupervised Learning

  • P. BafnaJatinderkumar R. Saini
  • Computer Science
    2019 9th International Conference on Emerging Trends in Engineering and Technology - Signal and Information Processing (ICETET-SIP-19)
  • 2019
The objective is to manage the documents and summarize Hindi corpus by applying extracting tokens and document clustering, an application of TF-IDF, cosine-based document similarity measures and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities.

Information Retrieval for Gujarati Language Using Cosine Similarity Based Vector Space Model

This is first IR task in Gujarati language using cosine similarity based calculations using VSDM, widely used in information retrieval and document classification where each document is represented as a vector and each dimension corresponds to a separate term.

Text Summarization in Indian Languages: A Critical Review

Challenges that researchers face and remedies are proposed while doing text summarization of Indian languages to derive better accurate results are emphasized.

Application of Latent Semantic Indexing for Hindi-English CLIR Irrespective of Context Similarity

It is proved that LSI based CLIR gets over the non-LSI based retrieval which have retrieval successes of 67% and 9% respectively.

An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages

Common tokens from corpora of verses and proses of Marathi as well as Hindi are identified to prove that both of them behave same as per as NLP activities are concerened and the betterment of BaSa over Zipf’s law is proved.

Survey of Progressive Era of Text Summarization for Indian and Foreign Languages Using Natural Language Processing

This text summarization is a conception which actually deals with time saving and giving user the output with minimum text without changing its meaning, which is very impressive.

Similar Meaning Analysis for Original Documents Identification in Arabic Language

This work constructed a corpus for Arabic and studied how this corpus could be used efficiently in the evaluation of Natural Language Processing (NLP) methods (i.e. Term Frequency-Inverse Document Frequency), Latent Semantic Analysis (LSA),Latent Dirichlet Allocation (LDA), word2vec, Global Vector Representation (GloVe), and Convolutional Neural Network (CNN) for paraphrase detection.

Reduction of Dimensionality of Feature Vectors in Subject Classification of Text Documents

Investigation of the influence of dimensionality reduction of feature vector (PCA and random projection) on the results of subject classification of text documents in Polish shows that PCA gives better accuracy in all analyzed cases.

Latent Semantic Kernels for WordNet: Transforming a Tree-Like Structure into a Matrix

  • Young-Bum KimYu-Seop Kim
  • Computer Science
    2008 International Conference on Advanced Language Processing and Web Information Technology
  • 2008
A matrix representing the WordNet hierarchical structure is proposed, which transforms a term as a vector with elements of each corresponding to a synset of WordNet and reduces the dimension size of the vector to represent the latent semantic structure.