Document categorization using semantic relatedness & Anaphora resolution: A discussion

  title={Document categorization using semantic relatedness \& Anaphora resolution: A discussion},
  author={Kaustubh D. Dhole and Harsh Kohli},
  journal={2015 IEEE International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)},
  • Kaustubh D. Dhole, Harsh Kohli
  • Published 1 November 2015
  • Computer Science
  • 2015 IEEE International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)
Document categorization is the process of assigning pre-defined categories to textual documents. State-of-the art approaches have modelled documents in terms of corpus-length long vectors and viewed the problem only from a syntactic perspective. We develop a general measure to estimate the semantic closeness of documents by utilizing the semantic relatedness of the most discriminative individual words that define the document. Anaphora resolution is used to strengthen the meaning ascribed to… 

Tables from this paper

Impact of Anaphora Resolution on Opinion Target Identification
This study empirically evaluated the impact of anaphora resolution using benchmark datasets and achieved accuracy such as precision: 88.14 recall: 71.45 and f-score: 72.12, respectively.
Innovations of Phishing Defense: The Mechanism, Measurement and Defense Strategies
A hybrid multi-layer model using Natural Language Processing (NLP) techniques for defending against phishing attacks is proposed, which enables a new prospect in detection of a potential attacker trying to manipulate the victim for revealing confidential information.


Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the
Similarity Measures for Text Document Clustering
A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy, and a comparison of these measures in partitional clustering for text document datasets is compared and analyzed.
Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language
  • P. Resnik
  • Computer Science
    J. Artif. Intell. Res.
  • 1999
This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content that performs better than the traditional edge-counting approach.
Understanding the Semantic Structure of Noun Phrase Queries
  • Xiao Li
  • Computer Science, Linguistics
  • 2010
This work formally defines the semantic structure of noun phrase queries as comprised of intent heads and intent modifiers and presents methods that automatically identify these constituents as well as their semantic roles based on Markov and semi-Markov conditional random fields.
A Language Modeling Approach to Information Retrieval
This work proposes an approach to retrieval based on probabilistic language modeling and integrates document indexing and document retrieval into a single model, which significantly outperforms standard tf.idf weighting on two different collections and query sets.
Using Information Content to Evaluate Semantic Similarity in a Taxonomy
This paper presents a new measure of semantic similarity in an IS-A taxonomy, based on the notion of information content, which performs encouragingly well and is significantly better than the traditional edge counting approach.
Machine learning in automated text categorization
This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Sentiment analysis of blogs by combining lexical knowledge with text classification
This paper presents a unified framework in which one can use background lexical information in terms of word-class associations, and refine this information for specific domains using any available training examples, and shows that this approach performs better than using background knowledge or training data in isolation.
A Review of Semantic Similarity Measures in WordNet 1
The paper contains a review of the state of art measures, including path Based measures, information based measures, feature based measures and hybrid measures, and the area of future research is described.
Little words can make a big difference for text classification
This work presents results from text classification experiments that compare relevancy signatures, which use local linguistic context, with corresponding indexing terms that do not, and suggests that stopword lists and stemming algorithms may remove or conflate many words that could be used to create more effective indexing Terms.