Text Classification by Augmenting Bag of Words (BOW) Representation with Co-occurrence Feature

@article{SoumyaGeorge2014TextCB,
  title={Text Classification by Augmenting Bag of Words (BOW) Representation with Co-occurrence Feature},
  author={K SoumyaGeorge and Shibily Joseph},
  journal={IOSR Journal of Computer Engineering},
  year={2014},
  volume={16},
  pages={34-38}
}
Text classification is the task of assigning predefined categories to free-text documents based on their content. Traditional approaches used unigram based models for text classification. Unigram based models such as Bag Of Words(BOW) models are not considering co-occurrence of set of words in a document level. This paper proposes a way to find co-occurrence feature from anchor text of wikipedia pages, proposes a way to incorporate co-occurrence feature to BOW model. Finally the method is… 

Figures from this paper

A new term weighting scheme based on class specific document frequency for document representation and classification
TLDR
A new feature for document representation under the VSM framework, class specific document frequency (CSDF), which leads to a novel term weighting scheme based on term frequency (TF), term presence (TP), and the newly proposed feature, CSDF and TF-CSDF are proposed.
Analysis and representation of Igbo text document for a text-based system
TLDR
The analysis of Igbo language text document, considering its compounding nature and its representation with the Word-based N-gram model shows that Bigram and Trigram n-gram text representation models provide more semantic information as well addresses the issues of compounding, word ordering and collocations which are the major language peculiarities in Igbo.
Analysis and representation of Igbo text document for a text-based system
TLDR
The analysis of Igbo language text document, considering its compounding nature and its representation with the Word-based N-gram model, shows that Bigram and Trigram n-gram text representation models provide more semantic information as well addresses the issues of compounding, word ordering and collocations which are the major language peculiarities in Igbo.
Feature extraction and performance measure of requirement engineering (RE) document using text classification technique
  • L. P. Saikia, Shilpi Singh
  • Computer Science
    2018 4th International Conference on Recent Advances in Information Technology (RAIT)
  • 2018
TLDR
The main objective of this experimentation is to utilize semantic information to identify features and prepare data sets for better classification of text as “Ambiguous’ or “Unambiguous”.
A Novel Feature Hashing With Efficient Collision Resolution for Bag-of-Words Representation of Text Data
TLDR
Using the vector data structure, the lookup performance is improved while resolving collision and the memory usage is also efficient.
Survey Paper on Feature Extraction Methods in Text Categorization
TLDR
In this paper, all the applied methods on feature extraction on text categorization from the traditional bag-of-words model approach to the unconventional neural networks are discussed.
Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity
TLDR
A comparative analysis of n-gram text representation on Igbo text document similarity adopted Euclidean similarity measure shows that unigram represented text has highest distance values whereas bigram has the lowest corresponding distance values.
Application of Deep Learning Techniques on Document Classification
TLDR
A comparative study on some of the basic building blocks used in deep learning, each of which can be applied to get simpler models trying to assign a class of the available documents, shows how these components can vary the impact on the task.
A Comparison of Supervised Text Classification and Resampling Techniques for User Feedback in Bahasa Indonesia
TLDR
This paper aims to implement several numerical representations and implementing resampling techniques (to handling imbalanced data), which are followed by evaluating some popular supervised machine learning classification algorithms, which are the Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Decision Tree.
Network text sentiment analysis method combining LDA text representation and GRU-CNN
  • Li-xia Luo
  • Computer Science
    Personal and Ubiquitous Computing
  • 2018
TLDR
A text sentiment analysis method combining Latent Dirichlet Allocation text representation and convolutional neural network (CNN) that can effectively improve the accuracy of text sentiment classification.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Text Classification Using WordNet Hypernyms
TLDR
Experiments show that for some of the more difficult tasks the hypernym density representation leads to significantly more accurate and more comprehensible rules.
A class-feature-centroid classifier for text categorization
TLDR
A fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization that consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-f1 scores.
Text categorization for multi-page documents: a hybrid naive Bayes HMM approach
TLDR
A method for classifying pages of sequential OCR text documents into one of several assigned categories is described and it is suggested that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy.
A sequential algorithm for training text classifiers
TLDR
An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task and reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
Machine learning in automated text categorization
TLDR
This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
A statistical learning learning model of text classification for support vector machines
TLDR
This model explains why and when SVMs perform well for text classification and connects the statistical properties of text-classification tasks with the generalization performance of a SVM in a quantitative way.
Taming Wild Phrases
TLDR
It is concluded that even the most careful term selection cannot overcome the differences in Document Frequency between phrases and words, and the use of term clustering to make phrases more cooperative is proposed.
Wikipedia in Action: Ontological Knowledge in Text Categorization
TLDR
A new, ontology-based approach to the automatic text categorization that does not require a training set, which is in contrast to the traditional statistical and probabilistic methods.
Journal Papers
TLDR
This paper presents a base model combination algorithm for resolving tied predictions for k-nearest neighbor aggregate models and a comparative study of sample selection methods for classification.
Approximate nearest neighbors: towards removing the curse of dimensionality
TLDR
Two algorithms for the approximate nearest neighbor problem in high-dimensional spaces are presented, which require space that is only polynomial in n and d, while achieving query times that are sub-linear inn and polynometric in d.
...
1
2
...