• Corpus ID: 17087045

Document Classification for Newspaper Articles

  title={Document Classification for Newspaper Articles},
  author={Dennis Ramdass and Shreyes Seshasai},
In many real-world scenarios, the ability to automatically classify documents into a fixed set of categories is highly desirable. Common scenarios include classifying a large amount of unclassified archival documents such as newspaper articles, legal records and academic papers. For example, newspaper articles can be classified as ’features’, ’sports’ or ’news’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into… 

Tables from this paper

Classification of Indonesian news articles based on Latent Dirichlet Allocation

A massive number of news articles leads to the potential problem in automatic classification task. The discussions on classification of English news articles have been widely studied. However, it is

Effect of imbalanced data on document classification algorithms

A significant finding from the research was that the algorithms performed similarly or in some cases even better, for imbalanced data compared to balanced data, which asserts the resilience of the probability based algorithms for text categorization.

Urdu Text Classification using Majority Voting

Five well-known classification techniques are applied on Urdu language corpus and assigned a class to the documents using majority voting to achieve up to 94% precision and recall usingmajority voting.

A Survey on text categorization of Indian and non-Indian languages using supervised learning techniques

This paper presents a survey of Text categorization of Indian and non-Indian languages and measures used to evaluate performance of text categorization are recall, precision and fmeasure.

Document Classification Method Based on Contents Using an Improved Multinomial Naïve Bayes Model

This research involved in improving and promoting the performance of the multinomial naive Bayes (MNB) classification by using three different approaches; at first by addition only the n-gram, the another one by applied the TF-IDF, and lastly by using both of n- gram and TF- IDF.

Supervised methods for domain classification of tamil documents

Dinakarannewspaper dataset from EMILLE/CIIL Corpus has been utilized to experiment the ability of Machine Learning algorithms in Tamil domain classification.

BCC NEWS Classification Comparison between Naïve Bayes, Support Vector Machine, Recurrent Neural Network

Various data classification models like Naïve Bayes, Support Vector Machine, and Logistic regression are compared to identify best module which gives accurate results in NEWS classification.

News Classification using Neural Networks

This paper presents a system for the classification of news articles based on artificial neural networks and has compared the results with the previously used techniques for classification.

Automatic classification of older electronic texts into the Universal Decimal Classification-UDC

A model for automated classification of old digitised texts to the Universal Decimal Classification (UDC) is developed, using machine-learning methods, and it suggests that machine- learning models can correctly assign the UDC at some level for almost any scholarly text.

Semantic Similarity Based Classification Of Narratives

The story retrieval is mainly due to the identification of the protagonist which can be classified as male based or female based narratives, and the comprehensibility can be found based on the algorithm with the improvised accuracy.



A comparison of event models for naive bayes text classification

It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.

Text Classification from Labeled and Unlabeled Documents using EM

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.

An Evaluation of Statistical Approaches to Text Categorization

Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature.

Context-sensitive learning methods for text categorization

RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information.

Unsupervised Multilingual Sentence Boundary Detection

A language-independent, unsupervised approach to sentence boundary detection based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified, which is able to detect abbreviations with high accuracy.

Using Maximum Entropy for Text Classification

This paper uses maximum entropy techniques for text classification by estimating the conditional distribution of the class variable given the document by comparing accuracy to naive Bayes and showing that maximum entropy is sometimes significantly better, but also sometimes worse.

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are

Combining Statistical and Relational Methods for Learning in Hypertext Domains

This work presents a new approach to learning hypertext classifiers that combines a statistical text-learning method with a relational rule learner and demonstrates that this new approach is able to learn more accurate classifiers than either of its constituent methods alone.

BoosTexter: A Boosting-based System for Text Categorization

This work describes in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks, and presents results comparing the performance of Boos Texter and a number of other text-categorization algorithms on a variety of tasks.

A Statistical Model for Parsing and Word-Sense Disambiguation

A first attempt at a statistical model for simultaneous syntactic parsing and generalized word-sense disambiguation is described, which achieves a recall of 84.0% and a precision of 67.3% on a new data set constructed for the task.