• Corpus ID: 14767303

Classifying Scientific Publications Using Abstract Features

  title={Classifying Scientific Publications Using Abstract Features},
  author={Cornelia Caragea and Adrian Silvescu and Saurabh Kataria and Doina Caragea and Prasenjit Mitra},
  booktitle={Symposium on Abstraction, Reformulation and Approximation},
With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, effective and efficient classification methods are required in order to deliver the appropriate information to specific users or groups. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" representation can result in a large number of features. Feature abstraction… 

Figures and Tables from this paper

Document Type Classification in Online Digital Libraries

This work proposes novel features that result in high-accuracy classifiers for document type classification and shows that these classifiers outperform models that are employed in current systems.

Automated Identification of Computer Science Research Papers

With large size of training set, Bi-gram modeling with normalized feature weight performs the best for all the two data sets and it is surprising that arXiv data set can be classified up to 0.95 F1 value, while CiteSeerX reaches lower F1 (0.764).

Classifying Computer Science Papers

It is found that computer science papers can be identified with high accuracy (F1 close to 0.95) and the best method is the bigram model using Multinomial Naive Bayes method and point-wise mutual information (PMI) as the feature selection method.

Co-Training for Topic Classification of Scholarly Data

A co-training approach that uses the text and citation information of a research article as two different views to predict the topic of an article is described, showing that this method improves significantly over the individual classifiers, while also bringing a substantial reduction in the amount of labeled data required for training accurate classifiers.

Identifying Documents In-Scope of a Collection from Web Archives

This paper studies both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents.

CiteSeer x : A Scholarly Big Dataset

This work proposes an approach to CiteSeer x metadata cleaning that incorporates information from an external data source, which is substantially cleaner than the entire set of data, and makes the new dataset available to the research community.

Dynamic Classification in Web Archiving Collections

This paper explores dynamic fusion models to find, on the fly, the model or combination of models that performs best on a variety of document types and shows that the approach that fuses different models outperforms individual models and other ensemble methods on three datasets.

Improving Sentiment Analysis in an Online Cancer

This work presents an automated approach for sentiment analysis in an online cancer survivor community and compares it with a previous sentiment analysis approach, both of which are machine learning based and are tested on the same dataset.

An Intelligent Opinion Mining for Customer Reviews

The proposed system gives an enhanced summarized result for newly users in order to take fast decision and extends the level of feature based opinion classification into positive and negative opinion.

Improving Sentiment Analysis in an Online Cancer Survivor Community Using Dynamic Sentiment Lexicon

This work presents an automated approach for sentiment analysis in an online cancer survivor community and compares it with a previous sentiment analysis approach, both of which are machine learning based and are tested on the same dataset.



The Power of Word Clusters for Text Classification

This work applies the information bottleneck method to find word-clusters that preserve the information about document categories and use these clusters as features for classification, and shows that when the training sample is small word clusters can yield significant improvement in classification accuracy.

Distributional Word Clusters vs. Words for Text Categorization

An approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier with a word-cluster representation is studied, which significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency.

LDA-based document models for ad-hoc retrieval

This paper proposes an LDA-based document model within the language modeling framework, and evaluates it on several TREC collections, and shows that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

An Introduction to Variable and Feature Selection

The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

Indexing by Latent Semantic Analysis

A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

Multinomial Event Model Based Abstraction for Sequence and Text Classification

Experimental results on protein localization sequences and Reuters text show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard NaiveBayes learner for the Multinomial Model.

CiteSeer: an automatic citation indexing system

CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.

Latent Dirichlet Allocation

Combining Super-Structuring and Abstraction on Sequence Classification

The results of the experiments show that adapting data representation by combining super-structuring and abstraction, makes it possible to construct predictive models that use significantly smaller number of features than those that are obtained using super- Structuring alone, without sacrificing predictive accuracy.

Probabilistic Latent Semantic Analysis

This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.