Classifying Scientific Publications Using Abstract Features
@inproceedings{Caragea2011ClassifyingSP, title={Classifying Scientific Publications Using Abstract Features}, author={Cornelia Caragea and Adrian Silvescu and Saurabh Kataria and Doina Caragea and Prasenjit Mitra}, booktitle={Symposium on Abstraction, Reformulation and Approximation}, year={2011} }
With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction…
16 Citations
Document Type Classification in Online Digital Libraries
- Computer ScienceAAAI
- 2016
This work proposes novel features that result in high-accuracy classifiers for document type classification and shows that these classifiers outperform models that are employed in current systems.
Automated Identification of Computer Science Research Papers
- Computer Science
- 2016
With large size of training set, Bi-gram modeling with normalized feature weight performs the best for all the two data sets and it is surprising that arXiv data set can be classified up to 0.95 F1 value, while CiteSeerX reaches lower F1 (0.764).
Classifying Computer Science Papers
- Computer Science
- 2016
It is found that computer science papers can be identified with high accuracy (F1 close to 0.95) and the best method is the bigram model using Multinomial Naive Bayes method and point-wise mutual information (PMI) as the feature selection method.
Co-Training for Topic Classification of Scholarly Data
- Computer ScienceEMNLP
- 2015
A co-training approach that uses the text and citation information of a research article as two different views to predict the topic of an article is described, showing that this method improves significantly over the individual classifiers, while also bringing a substantial reduction in the amount of labeled data required for training accurate classifiers.
Identifying Documents In-Scope of a Collection from Web Archives
- Computer ScienceJCDL
- 2020
This paper studies both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents.
CiteSeer x : A Scholarly Big Dataset
- Computer ScienceECIR
- 2014
This work proposes an approach to CiteSeer x metadata cleaning that incorporates information from an external data source, which is substantially cleaner than the entire set of data, and makes the new dataset available to the research community.
Dynamic Classification in Web Archiving Collections
- Computer ScienceLREC
- 2020
This paper explores dynamic fusion models to find, on the fly, the model or combination of models that performs best on a variety of document types and shows that the approach that fuses different models outperforms individual models and other ensemble methods on three datasets.
Improving Sentiment Analysis in an Online Cancer
- Computer Science
- 2013
This work presents an automated approach for sentiment analysis in an online cancer survivor community and compares it with a previous sentiment analysis approach, both of which are machine learning based and are tested on the same dataset.
An Intelligent Opinion Mining for Customer Reviews
- Computer Science
- 2015
The proposed system gives an enhanced summarized result for newly users in order to take fast decision and extends the level of feature based opinion classification into positive and negative opinion.
Improving Sentiment Analysis in an Online Cancer Survivor Community Using Dynamic Sentiment Lexicon
- Computer Science2013 International Conference on Social Intelligence and Technology
- 2013
This work presents an automated approach for sentiment analysis in an online cancer survivor community and compares it with a previous sentiment analysis approach, both of which are machine learning based and are tested on the same dataset.
References
SHOWING 1-10 OF 35 REFERENCES
The Power of Word Clusters for Text Classification
- Computer Science
- 2006
This work applies the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification, and shows that when the training sample is small word clusters can yield significant improvement in classification accuracy.
Distributional Word Clusters vs. Words for Text Categorization
- Computer ScienceJ. Mach. Learn. Res.
- 2003
An approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier with a word-cluster representation is studied, which significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency.
LDA-based document models for ad-hoc retrieval
- Computer ScienceSIGIR
- 2006
This paper proposes an LDA-based document model within the language modeling framework, and evaluates it on several TREC collections, and shows that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
An Introduction to Variable and Feature Selection
- Computer ScienceJ. Mach. Learn. Res.
- 2003
The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Indexing by Latent Semantic Analysis
- Computer ScienceJ. Am. Soc. Inf. Sci.
- 1990
A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Multinomial Event Model Based Abstraction for Sequence and Text Classification
- Computer ScienceSARA
- 2005
Experimental results on protein localization sequences and Reuters text show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard NaiveBayes learner for the Multinomial Model.
CiteSeer: an automatic citation indexing system
- Computer ScienceDL '98
- 1998
CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.
Combining Super-Structuring and Abstraction on Sequence Classification
- Computer Science2009 Ninth IEEE International Conference on Data Mining
- 2009
The results of the experiments show that adapting data representation by combining super-structuring and abstraction, makes it possible to construct predictive models that use significantly smaller number of features than those that are obtained using super- Structuring alone, without sacrificing predictive accuracy.
Probabilistic Latent Semantic Analysis
- Computer ScienceUAI
- 1999
This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.