Corpus ID: 17087045

Document Classification for Newspaper Articles

@inproceedings{Ramdass2012DocumentCF,
  title={Document Classification for Newspaper Articles},
  author={Dennis Ramdass and S. Seshasai},
  year={2012}
}
In many real-world scenarios, the ability to automatically classify documents into a fixed set of categories is highly desirable. Common scenarios include classifying a large amount of unclassified archival documents such as newspaper articles, legal records and academic papers. For example, newspaper articles can be classified as ’features’, ’sports’ or ’news’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into… Expand

Tables from this paper

Classification of Indonesian news articles based on Latent Dirichlet Allocation
A massive number of news articles leads to the potential problem in automatic classification task. The discussions on classification of English news articles have been widely studied. However, it isExpand
Effect of imbalanced data on document classification algorithms
TLDR
A significant finding from the research was that the algorithms performed similarly or in some cases even better, for imbalanced data compared to balanced data, which asserts the resilience of the probability based algorithms for text categorization. Expand
Urdu Text Classification using Majority Voting
TLDR
Five well-known classification techniques are applied on Urdu language corpus and assigned a class to the documents using majority voting to achieve up to 94% precision and recall usingmajority voting. Expand
A Survey on text categorization of Indian and non-Indian languages using supervised learning techniques
Categorization of text plays an important role in the text mining field. Text categorization is the process in which documents are categorized into its predefined category. Automatic textExpand
Supervised methods for domain classification of tamil documents
TLDR
Dinakarannewspaper dataset from EMILLE/CIIL Corpus has been utilized to experiment the ability of Machine Learning algorithms in Tamil domain classification. Expand
BCC NEWS Classification Comparison between Naïve Bayes, Support Vector Machine, Recurrent Neural Network
Data Classification is used to determine the category to which data belongs. In the present era, due to technology and information, NEWS is easily accessible through online sources. In Day to dayExpand
News Classification using Neural Networks
TLDR
This paper presents a system for the classification of news articles based on artificial neural networks and has compared the results with the previously used techniques for classification. Expand
Automatic classification of older electronic texts into the Universal Decimal Classification-UDC
TLDR
A model for automated classification of old digitised texts to the Universal Decimal Classification (UDC) is developed, using machine-learning methods, and it suggests that machine- learning models can correctly assign the UDC at some level for almost any scholarly text. Expand
Semantic Similarity Based Classification Of Narratives
Semantic similarity between the Narratives is found by constructing a semantic network that defines the semantic relatedness which can be introduced to classify the document. The story retrieval isExpand
The New Incinerator in Parma and the News from Newspapers - The Importance of Communication in Terms of "Environment and Health"
Objective is the evaluating of news concerning the incinerator in Parma and assessing any potential information gap, which could be addressed by institutional communication. Articles from both onlineExpand
...
1
2
...

References

SHOWING 1-10 OF 17 REFERENCES
A comparison of event models for naive bayes text classification
TLDR
It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size. Expand
Text Classification from Labeled and Unlabeled Documents using EM
TLDR
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions. Expand
An Evaluation of Statistical Approaches to Text Categorization
TLDR
Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Expand
Context-sensitive learning methods for text categorization
TLDR
RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information. Expand
Unsupervised Multilingual Sentence Boundary Detection
TLDR
A language-independent, unsupervised approach to sentence boundary detection based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified, which is able to detect abbreviations with high accuracy. Expand
Using Maximum Entropy for Text Classification
TLDR
This paper uses maximum entropy techniques for text classification by estimating the conditional distribution of the class variable given the document by comparing accuracy to naive Bayes and showing that maximum entropy is sometimes significantly better, but also sometimes worse. Expand
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs areExpand
Combining Statistical and Relational Methods for Learning in Hypertext Domains
TLDR
This work presents a new approach to learning hypertext classifiers that combines a statistical text-learning method with a relational rule learner and demonstrates that this new approach is able to learn more accurate classifiers than either of its constituent methods alone. Expand
BoosTexter: A Boosting-based System for Text Categorization
TLDR
This work describes in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks, and presents results comparing the performance of Boos Texter and a number of other text-categorization algorithms on a variety of tasks. Expand
A Statistical Model for Parsing and Word-Sense Disambiguation
TLDR
A first attempt at a statistical model for simultaneous syntactic parsing and generalized word-sense disambiguation is described, which achieves a recall of 84.0% and a precision of 67.3% on a new data set constructed for the task. Expand
...
1
2
...